Concepts and features
features-concepts.Rmd
epivaultr
is an R package used to extract data from EpiVault, the Born in Bradford
data warehouse. It can be used for quick grabs of specific variables
from specific tables. It can also be used to manage an entire data
request end to end, from reading a user’s requested variables, to
writing output files ready for shipping.
This guide explains various concepts and features employed by epivaultr
.
Working with variables
Variable naming
EpiVault organises data into projects, tables and variables with the following properties:
- An instance of EpiVault contains projects.
- Each project has a unique name within its EpiVault instance
- A project contains tables.
- Each table has a unique name within its project.
- A table contains observations in rows and variables in columns. The
name of each column is the variable name.
- Each variable has a unique name within its table.
As each variable has a unique name within its table, the variable name can safely be used to refer to the variable within the context of that table.
However, within the wider context of EpiVault, the project
name and table name are needed to uniquely refer to the
variable. We separate these with dots,
e.g. project_name.table_name.variable_name
ande call this
the fully qualified variable name:
Reference type | Reference | Unique within |
---|---|---|
project name | BiB_CohortInfo | EpiVault |
table name | person_info | project (BiB_CohortInfo) |
variable name | Gender | table (person_info) |
fully qualified variable name | BiB_CohortInfo.person_info.Gender | EpiVault |
Variable file formats
Most data requests and complex projects will start by reading a
variable list from a file using read_ev_variables()
.
Although simple data requests can be run directly in code by
selecting variables using make_ev_variables()
or doing a
quick table query using ev_simple_fetch()
, maintaining a
variables file is considered more reproducible and future-proof.
read_ev_variables()
supports delimited text or MS Excel
formats, which can be in one of the following structures:
Format | Column 1 | Column 2 | Column 3 |
---|---|---|---|
Single column | Fully qualified variable name | none | none |
Three columns | project name | table name | variable name |
A header row is optional and read_ev_variables()
will
skip over this if the column names are obvious,
e.g. project
, table
, variable
,
name
etc.
If read_ev_variables()
returns an empty result then it
may have encountered problems interpreting the structure. Try the
following:
- Remove the header row
- Remove any blank rows that occur before the end of the file
- Make sure there are exactly one or three columns
- If there is one column, make sure this contains fully qualified variable names
- If there are three columns, make sure these are project, table and variable in that order. For three-column text formats make sure a recognised delimiter is used, e.g. a tab or a comma.
If the file extension is csv
, txt
, or
tsv
, it will try to read the file as a delimited text file,
e.g. comma- or tab-separated or fixed width. If it cannot determine the
delimiter it will try to read it line by line.
Wildcards
Wildcards can be used in variable names to capture groups of
variables from a single table in one go. ?
matches a single
character and *
matches any number of characters.
Currently, wildcards can only be used in variable names, not table or
project names.
BiB_Baseline.base_m_survey.* |
Returns all variables in the table BiB_Baseline_base_m_survey |
BiB_Baseline.base_m_survey.eth* |
Returns all variables with the prefix eth in the table
BiB_Baseline_base_m_survey |
BiB_Baseline.base_m_survey.agem?_mbqall |
Returns BiB_Baseline.base_m_survey.agemm_mbqall and
BiB_Baseline.base_m_survey.agemy_mbqall
|
BiB_Baseline.base_?_survey.* |
Not supported: wildcards can only be used in the variable name part |
Wildcards in the variable name will work in either one column or three column variables file formats.
Variable visibility
Variable visibility is the concept used to manage fine-grained data access permissions in EpiVault. Every variable has a visibility level assigned in the variable metadata from 0 to 9. The visibility value reflects the level of elevated privileges a user needs to be able to access it. So:
variable visibility | meaning |
---|---|
0 | All users can see this variable |
5 | All users with a privilege level of 5 and above can see this variable |
9 | Only users with the highest privilege level can see this variable |
And, conversely:
user privilege level | meaning |
---|---|
0 | Can only access variables with visibility level 0 |
5 | Can access all variables with visibility level 5 and below |
9 | Can access all variables |
For the most part, only visibility levels 0 and 9 are used. Variables that any user can access are given visibility level 0. Sensitive variables that only users with elevated privileges can access are given visibility level 9.
When epivaultr
queries EpiVault, i.e. via the fetch_
functions, a visibility
parameter is required. This
defaults to 0, meaning it will only return variables that are visible to
all users. If you need to access a sensitive variable, e.g. date of
birth, you will need to assign a higher value to the
visibility
parameter, probably 9. For this to work, your
user account will need certain elevated privileges to be assigned within
the EpiVault database.
Required columns
A variable can be indicated as required in its metadata. Whenever data from a table is queried, all required columns will be returned, as well as those requested.
For many tables, the only required column will be a record id such as
a person_id
. But often there may be other required
variables, such as additional row identifiers, or important dimensions
such as date or age.
If you need to check for required variables in a data request, you
can use fetch_ev_meta_vars()
to return the variable metadata and inspect the required
column (1=required). Alternatively, you can access the same variable
metadata from an ev_data
container using get_ev_metadata()
.
Containers
A container in epivaultr
is a collection of nested lists. They contain various data frames and
vectors that are bundled together for the convenience of being able to
pass them together to function calls. The contents can be accessed using
a set of get_
functions.
There are two main types of container:
ev_variables |
The data request: contains the variables requried for a data request, with associated information |
ev_data |
The data extract: contains the data returned by a data request, with associated metadata |
ev_variables
- the data request
description | retrieve using | ||
---|---|---|---|
ev_variables |
variables |
A vector containing fully qualified variable names. This is the basis of the data request. | get_ev_variables() |
projects |
A vector of names of the projects containing the requested variables. | get_ev_projects() |
|
tables |
A vector of names of the tables containing the requested variables. | get_ev_tables() |
|
vars_df |
A data frame constructed from the above information, with columns containing the fully qualified variable name, project name, table name and variable name for each variable requested. | get_ev_vars_df() |
ev_data
- the data extract
description | retrieve using | ||
---|---|---|---|
ev_data |
data |
A list of data frames, one per table returned by the data request. | get_ev_data() |
metadata |
A list of three data frames containing metadata about the data
returned: variable contains variable-level metadata for all
variables; category contains value labels used by all the
categorical variables in the request; table contains
metadata about all the tables used in the request. |
get_ev_metadata() |
|
request |
A copy of the ev_variables container used for the data
request. |
get_ev_request() |
A note on get_
functions
As the containers are just lists objects, you can access the contents
directly using $
notation. However, use of the get_
functions wherever possible is recommended, as this will be more robust
to any future changes in the internal structure of the containers.
For example, say we have an ev_data
container called data_request
that contains a data frame
called proj1.tab1
. We can access this and assign it to a
data frame called dat
in two ways:
# using the get_ function - recommended, won't break in future
dat <- get_ev_data(data_request, df_name = "proj1.tab1")
# does the same thing - but not recommended, may break in future
dat <- data_request$data$proj1.tab1
If the internal structure of the containers changes in future, the
implementation of the associated get_
functions will change in parallel, so previous code that uses these
should still run OK. But any code that addresses the container elements
directly using $
notation may break.