How to use yadg
We have prepared an interactive, Binder-compatible Jupyter notebook, showing the installation and example usage of yadg. The latest version of the notebook and the direct link to Binder are:
There are two main ways of using yadg:
A limited extractor mode, useful to extract (meta)-data from single, separate files.
A fully featured parser mode, requiring a dataschema, intended to process all files semantically related to a single “experiment”.
Extractor mode
In this mode, yadg can be invoked by providing just the FileType and the path to the input file:
yadg extract filetype infile [outfile]
The infile
will be then parsed using yadg and, if successful, saved as a NetCDF
file, optionally using the specified outfile
location. The resulting NetCDF
files will contain annotation of provenance (i.e. yadg extract
), filetype information, and the resolved defaults of timezone, locale, and encoding used to create the file.
Warning
The extractor mode has been introduced in yadg-5.0
and its API is not yet stable.
Warning
In extractor mode, yadg assumes the following defaults:
timezone is set to the
localtime
of the localhost,locale is set to the default
LC.NUMERIC
locale of the localhost,encoding of the input files is set to
UTF-8
or the extractor default.
All of the above options might lead to improper parsing of the input files. Errors due to improper locale might be obvious (e.g. data parsed using wrong decimal separators); incorrect timezone information may lead to errors that are more subtle.
Metadata-only extraction
To use yadg to extract and retrieve just the metadata contained in the input file, pass the --meta-only
argument:
yadg extract --meta-only filetype infile
The metadata are returned as a .json
file, and are generated using the to_dict()
function of xarray.Dataset
. They contain a description of the data coordinates (coords
), dimensions (dims
), and variables (data_vars
), and include their names, attributes, dtypes, and shapes.
The list of supported filetypes that can be extracted using yadg can be found in the left sidebar. For more information about the extractor concept, see MaRDA Metadata Extractors WG.
Parser mode
The main purpose of yadg is to process a bunch of raw data files according to a provided dataschema into a well-defined, annotated, FAIR-data file called datagram. As of yadg-5.0
, the datagram is a NetCDF
file. To use yadg like this, it should be invoked as follows:
yadg process infile [outfile]
Where infile
corresponds to the json
or yaml
file containing the dataschema, and the optional outfile
is the filename to which the created datagram should be saved (it defaults to datagram.nc
).
In this fully-featured usage pattern via dataschema, yadg offloads the responsibility of data extraction and normalisation to its modules, called parsers. The currently implemented parsers are documented in the sidebar.
Dataschema from presets
This alternative form of using yadg in parser mode is especially useful when processing data organised in a consistent folder structure between several experimental runs. The user should prepare a preset file, which then gets patched to a dataschema file using the provided folder path:
yadg preset infile folder [outfile]
Where infile
is the preset, folder
is the folder path for which the preset should be modified, and the optional outfile
is the filename to which the created dataschema should be saved.
Alternatively, if the dataschema should be processed immediately, the --process
(or -p
) switch can be used with the following usage pattern:
yadg preset -p infile folder [outfile.json]
This syntax will process the created dataschema immediately, and the datagram will be saved to outfile.json
instead.
Finally, the raw data files in the processed folder
can be archived, checksumed, and referenced in the datagram, by using the following pattern:
yadg preset -p -a infile folder [outfile.json]
This will create a datagram in outfile.json
as well as a outfile.zip
archive from the whole contents of the specified folder
.
Dataschema version updater
If you’d like to update a dataschema from a previous version of yadg to the current latest one, use the following syntax:
yadg update infile [outfile]
This will update the dataschema specified in infile
and save it to outfile
, if provided.