yadg objects

yadg dataschema

A dataschema is an object defining the files and folders to be processed by yadg, as well as the types of parsers and the parser options to be applied. One can think of a dataschema as a representation of a single experiment, containing measurements from multiple sources (or devices) and following a succession of experimental steps.

The current version of the dataschema is implemented as a Pydantic model in the class DataSchema of the dgbowl_schemas.yadg module. The following (previous) versions of the dataschema are available in the same repository:

An example is a simple catalytic test with a temperature ramp. The goal of such an experiment may be to measure the catalytic conversion as a function of temperature, and then calculate the activation energy of the catalytic reaction. The monitored devices and their filetypes are:

  • the inlet flow and pressure meters -> csv data in foo.csv

  • the temperature controller -> csv data in bar.csv

  • the gas chromatograph -> Fusion json data in ./GC/ folder

Despite these three devices measuring concurrently, we would have to specify three separate steps in the schema to process all relevant output files:

{
    "version": "5.1",
    "metadata": {"provenance": "manual"},
    "step_defaults": {
        "timezone": "Europe/Berlin",
        "locale": "de_DE"
    },
    "steps": [
        {
            "tag": "flow",
            "extractor": {
                "filetype": "basic.csv",
                "locale": "en_GB",
                "parameters": {"sep": ","}
            },
            "input": {"files": ["foo.csv"]}
        },
        {
            "extractor": {
                "filetype": "basic.csv"
            },
            "input": {"files": ["bar.csv"]}
        },
        {
            "extractor": {
                "filetype": "fusion.json"
            },
            "input": {"folders": ["./GC/"]}
        }
    ]
}

As we set step_defaults -> locale to de_DE, the numbers in the localized files (such as the csv data) will be expected to use , as a decimal separator. These step_defaults can be overriden in each step using the extractor entry, see the steps -> [0] -> extractor -> locale entry which is set to en_GB.

Note

Further information about the dataschema can be found in the documentation of the dgbowl_schemas.yadg module.

yadg DataTree

The DataTree objects generated by yadg are structured and annotated representations of raw data. Here, “raw data” strictly denotes the data present in the parsed files, as they come out of an instrument directly. It may therefore also contain derived data (e.g. data processed using a calibration curve in chromatography, or a more involved transformation in electrochemistry), while referring to them as “raw data”, since they are present in the parsed files.

The DataTree is designed to be a FAIR representation of the parsed files, with:

  • uncertainties of measured datapoints;

  • units associated with data;

  • a consistent data structure for timestamped traces;

  • a standardised layout for original metadata contained in the parsed files,

  • a consistent variable mapping between different filetypes.

Additionally, the DataTree is annotated by yadg-specific metadata, including:

  • version information;

  • clear provenance of the data;

  • uniform data timestamping within and between all DataTrees.

As of yadg-5.0, the DataTree can be exported as a NetCDF file, using HDF5 groups to store individual steps. In memory, the individual steps are nodes of the DataTree, containing a Dataset. The top level DataTree contains the following metadata stored in its attributes:

  • yadg_version: the version of yadg and the execution command used to generate the DataTree;

  • yadg_command: the command line invocation of yadg used to create the DataTree;

  • yadg_process_date: the DataTree creation timestamp formatted according to ISO8601;

  • yadg_process_DataSchema: a copy of the dataschema used to created the DataTree.

The contents of the attribute fields for each step will vary depending on the extractor used to create the corresponding DataTree node. The following conventions are used:

  • the yadg-provenance metadata include yadg_extract_date and yadg_extract_Extractor containing the ISO8601 timestamp and a copy of the extractor defaults used to create this node;

  • an original_metadata entry containing all extracted metadata present in the original files;

  • a coord field uts contains a Unix timestamp (float);

  • uncertainties for entries in data_vals are stored using separate entries with names composed as f"{entry}_std_err"

    • the parent f"{entry}" is pointing to its uncertainty by annotation using the ancillary_variables field;

    • the uncertainty links back to the f"{entry}" by annotation using the standard_name field;

  • nested metadata is serialised using json serialisation;

  • the use of spaces (and other whitespace characters) in the names of entries is to be avoided;

  • the use of forward slashes (/) in the names of entries is not allowed;

This follows the NetCDF CF Metadata Conventions, see Section 3.4 on Ancillary Data.