yadg objects
yadg dataschema
A dataschema is an object defining the files and folders to be processed by yadg, as well as the types of parsers and the parser options to be applied. One can think of a dataschema as a representation of a single experiment, containing measurements from multiple sources (or devices) and following a succession of experimental steps.
The current version of the dataschema is implemented as a Pydantic model in the
class DataSchema
of the dgbowl_schemas.yadg
module. Previous versions of the dataschema are available in the same repository:
An example is a simple catalytic test with a temperature ramp. The goal of such an experiment may be to measure the catalytic conversion as a function of temperature, and then calculate the activation energy of the catalytic reaction. The monitored devices and their filetypes are:
the inlet flow and pressure meters ->
csv
data infoo.csv
the temperature controller ->
csv
data inbar.csv
the gas chromatograph -> Fusion
json
data in./GC/
folder
Despite these three devices measuring concurrently, we would have to specify three separate steps in the schema to process all relevant output files:
{
"metadata": {
"provenance": {
"type": "manual"
},
"version": "4.1"
},
"steps": [{
"parser": "basiccsv",
"input": {"files": ["foo.csv"]},
"tag": "flow",
},{
"parser": "basiccsv",
"input": {"files": ["bar.csv"]}
},{
"parser": "chromtrace",
"input": {"folders": ["./GC/"]},
"parameters": {"filetype": "fusion.json"}
}]
}
Note
Further information about the dataschema can be found in the documentation of
the dgbowl_schemas.yadg
module.
yadg datagram
The datagram is a structured and annotated representation of both raw and
processed data. Here, “raw data” corresponds to data present in the output files
as they come out of an instrument directly, while “processed data” corresponds
to data after any processing – whether the processing is applying a calibration
curve, or a more involved transformation, such as deriving \(Q_0\) and
\(f_0\) from \(\Gamma(f)\) in the yadg.parsers.qftrace
module.
The datagram is designed to be a mirror of the raw data files, with:
uncertainties of measured datapoints
units associated with data
a consistent data structure for timestamped traces
a consistent variable mapping between different filetypes
Additionally, the datagram is annotated by relevant metadata, including:
version information
clear provenance of raw data
documentation of any post-processing
uniform data timestamping between all datagrams
Note
The datagram does not guarantee that the data in the timesteps is normalized. This means entries may or may not be present in all timesteps within a step. An example of this would be if an analyte appears in chromatographic traces after the first timesteps - the entry corresponding to the concentration of that analyte is not back-filled by yadg.
TODO
https://github.com/dgbowl/yadg/issues/4
The specification of the datagram schema should be moved to a Pydantic-based
model. This feature is expected to be included in yadg-5.0
.
As of yadg-4.0, the datagram is a dict
which always contains two top-level
entries:
The
metadata
dict
with information about:the version of yadg and the execution command used to generate the datagram;
a copy of the dataschema used to created the datagram;
the version of the datagram; and
the datagram creation timestamp formatted according to ISO8601.
The
steps
list[dict]
containing the data. The length of this array matches the length of thesteps
in the dataschema used to generate the datagram. Each element within thesteps
has further mandatory entries:The step specific
metadata
formatted as adict
The
data
list[dict]
, containing the actual data, organised as a time series. Each entry in"data"
has:a Unix timestamp in its
uts
float
entry,a filename of the raw data in its
fn
str
entry,a
raw
dict
entry containing any data directly fromfn
,a
derived
dict
entry containing any post-processed data.
Warning
The post-processing aspects of yadg are deprecated in favour of the dgpost
package and will likely be removed from yadg-5.0
.
All measurement (floating-point) data has to be provided using the "property": {"n":
value, "s": error, "u": "unit"}
syntax, where both "n"
and "s"
are
float
and "u"
is str
. The data can be organised in nested data
structures, however it is recursively validated.
In most cases, the data in will consist of a single value per timestep. However, it
is also possible to store lists of data in each timestep. Generally, yadg will store
such data under a traces
key in the appropriate raw
or derived
entry:
"raw": { "traces": { "FID": { "t": {"n": [0, 1, 2, 3, 4], "s": [0.1, 0.1, 0.1, 0.1, 0.1], "u": "s"}, "y": {"n": [5, 6, 9, 9, 4], "s": [0.5, 0.5, 0.5, 0.5, 0.5], "u": " "}, } } }
The above example shows how a chromatographic trace might be stored. At each timestep,
multiple values of "t"
and "y"
are recorded and stored in a list
, along
with their uncertainties; the unit applies to each element in the array.
Note
Futher information about the datagram can be found in the documentation of
the datagram validator function: yadg.core.validators.validate_datagram()
,
as well as in the documentation of each parser.