How to use yadg

We have prepared an interactive, Binder-compatible Jupyter notebook, showing the installation and example usage of yadg. The latest version of the notebook and the direct link to Binder are:

https://zenodo.org/badge/DOI/10.5281/zenodo.6351210.svg https://mybinder.org/badge_logo.svg

There are two main ways of using yadg:

  1. A limited extractor mode, useful to extract (meta)-data from individual files.

  2. A fully featured parser mode, intended to process all files semantically related to a single “experiment”. This mode requires a dataschema.

Extractor mode

In this mode, yadg can be invoked by providing just the FileType and the path to the input file:

yadg extract filetype infile [outfile]

The infile will be then parsed using yadg into a DataTree, and, if successful, saved as a NetCDF file, optionally using the specified outfile location. In addition to any original_metadata stored in the .attrs object of the resulting DataTree, it will contain yadg-specific metadata, including the annotation of provenance (i.e. yadg extract), filetype information, and the resolved defaults of timezone, locale, and encoding used to create it.

Warning

By default, in extractor mode, yadg assumes the following defaults:

  • timezone is set to the localtime of the localhost,

  • locale is set to the default LC.NUMERIC locale of the localhost,

  • encoding of the input files is set to UTF-8 or the extractor default.

All of the above options might lead to improper parsing of the input files. While errors due to improper encoding are likely to be immediately obvious as they lead to crashes; locale errors might only be obvious upon inspection of data (e.g. data parsed using wrong decimal separators); and incorrect timezone information may lead to errors that are much more subtle. You can specify the correct values for these three parameters, if known, on the command line using:

yadg extract --locale=de_DE --encoding=utf-8 --timezone=Europe/Berlin filetype infile [outfile]

API endpoint for extractor mode

If you want to use yadg in your own code, you should use the common extractors API available in the yadg.extractors module:

yadg.extractors.extract(filetype: str, path: Path | str, timezone: str | None = None, encoding: str | None = None, locale: str | None = None, **kwargs: dict) DataTree

Extract data and metadata from a path using the supplied filetype.

A wrapper around the extract_from_path() worker function, which creates a default extractor object. Coerces any str`s provided as ``path` to Path.

Parameters:
  • filetype – Specifies the filetype. Has to be a filetype supported by the dataschema.

  • path – A Path object pointing to the file to be extracted.

  • timezone – A str containing the TZ identifier, e.g. “Europe/Berlin”.

  • encoding – A str containing the encoding, e.g. “utf-8”.

  • locale – A str containing the locale name, e.g. “de_CH”.

yadg.extractors.extract_from_path(source: Path, extractor: FileType) DataTree

Extracts data and metadata from the provided path using the supplied extractor.

The individual extractor functionality of yadg is called from here. The data is always returned as a DataTree. The original_metadata entries in the returned objects are flattened using json serialisation. The returned objects have a to_netcdf() as well as a to_dict() method, which can be used to write the returned object into a file.

Parameters:
  • source – A Path pointing to the extracted file.

  • extractor – A FileType object describing the extraction process.

yadg.extractors.extract_from_bytes(source: bytes, extractor: FileType) DataTree

Extracts data and metadata from the provided raw bytes using the supplied extractor.

The individual extractor functionality of yadg is called from here. The data is always returned as a DataTree. The original_metadata entries in the returned objects are flattened using json serialisation. The returned objects have a to_netcdf() as well as a to_dict() method, which can be used to write the returned object into a file.

Parameters:
  • source – A bytes object containing the raw data to be extracted.

  • extractor – A FileType object describing the extraction process.

Warning

Please do not use the extract() functions from each extractor (e.g. yadg.extractors.eclab.mpr.extract_from_path()) directly. Those are not part of the user-facing API and their function signatures may change between minor or point versions.

Metadata-only extraction

To use yadg to extract and retrieve just the metadata contained in the input file, pass the --meta-only argument:

yadg extract --meta-only filetype infile

The metadata are returned as a .json file, and are generated using the to_dict() function of xarray.Dataset. They contain a description of the data coordinates (coords), dimensions (dims), and variables (data_vars), and include their names, attributes, dtypes, and shapes.

The list of supported filetypes that can be extracted using yadg can be found in the left sidebar. For more information about the extractor concept, see the Datatractor project.

Parser mode

The main purpose of yadg is to process a bunch of raw data files according to a provided dataschema into well-defined, annotated, FAIR-data NetCDF files. To use yadg like this, it should be invoked as follows:

yadg process infile [outfile]

Where infile corresponds to the json or yaml file containing the dataschema, and the optional outfile is the filename to which the created DataTree should be saved (it defaults to datagram.nc).

In this fully-featured usage pattern via dataschema, the individual extractors can be further configured and combined. The currently implemented extractors are documented in the sidebar.

Dataschema from presets

This alternative form of using yadg in parser mode is especially useful when processing data organised in a consistent folder structure between several experimental runs. The user should prepare a preset file, which then gets patched to a dataschema file using the provided folder path:

yadg preset infile folder [outfile]

Where infile is the preset, folder is the folder path for which the preset should be modified, and the optional outfile is the filename to which the created dataschema should be saved.

Alternatively, if the dataschema should be processed immediately, the --process (or -p) switch can be used with the following usage pattern:

yadg preset -p infile folder [outfile.nc]

This syntax will process the created dataschema immediately, and the DataTree will be saved to outfile.nc instead.

Finally, the raw data files in the processed folder can be archived, checksumed, and referenced in the DataTree, by using the following pattern:

yadg preset -p -a infile folder [outfile.nc]

This will create a NetCDF file in outfile.nc as well as a outfile.zip archive including the whole contents of the specified folder.

Dataschema version updater

If you’d like to update a dataschema from a previous version of yadg to the current latest one, use the following syntax:

yadg update infile [outfile]

This will update the dataschema specified in infile and save it to outfile, if provided.

API for processing dataschema

yadg.core.process_schema(dataschema: DataSchema, strict_merge: bool = False) DataTree

The main DataSchema processing function of yadg.

Takes in a DataSchema object, updates it to the latest version compatible with the installed version of yadg, processes each step, and returns a single DataTree created from the DataSchema.

Parameters:
  • dataschema – A DataSchema object describing the extraction process.

  • strict_merge – A bool indicating whether metadata of the files processed in a single step has to be identical. Defaults to False which means conflicts will be dropped.