How to use **yadg** =================== We have prepared an interactive, Binder-compatible Jupyter notebook, showing the installation and example usage of **yadg**. The latest version of the notebook and the direct link to Binder are: .. image:: https://zenodo.org/badge/DOI/10.5281/zenodo.6351210.svg :target: https://doi.org/10.5281/zenodo.6351210 .. image:: https://mybinder.org/badge_logo.svg :target: https://mybinder.org/v2/zenodo/10.5281/zenodo.6351210/?labpath=index.ipynb There are two main ways of using **yadg**: #. A limited `extractor` mode, useful to extract (meta)-data from single, separate files. #. A fully featured `parser` mode, requiring a `dataschema`, intended to process all files semantically related to a single "experiment". .. _extractor mode: `Extractor` mode ---------------- In this mode, **yadg** can be invoked by providing just the `FileType` and the path to the input file: .. code-block:: bash yadg extract filetype infile [outfile] The ``infile`` will be then parsed using **yadg** and, if successful, saved as a |NetCDF|_ file, optionally using the specified ``outfile`` location. The resulting |NetCDF|_ files will contain annotation of provenance (i.e. ``yadg extract``), `filetype` information, and the resolved defaults of `timezone`, `locale`, and `encoding` used to create the file. .. warning:: The `extractor` mode has been introduced in ``yadg-5.0`` and its API is not yet stable. .. warning:: In `extractor` mode, **yadg** assumes the following defaults: - `timezone` is set to the ``localtime`` of the `localhost`, - `locale` is set to the default ``LC.NUMERIC`` locale of the `localhost`, - `encoding` of the input files is set to ``UTF-8`` or the `extractor` default. All of the above options might lead to improper parsing of the input files. Errors due to improper `locale` might be obvious (e.g. data parsed using wrong decimal separators); incorrect `timezone` information may lead to errors that are more subtle. Metadata-only extraction ```````````````````````` To use **yadg** to extract and retrieve just the metadata contained in the input file, pass the ``--meta-only`` argument: .. code-block:: bash yadg extract --meta-only filetype infile The metadata are returned as a ``.json`` file, and are generated using the :func:`~xarray.Dataset.to_dict` function of :class:`xarray.Dataset`. They contain a description of the data coordinates (``coords``), dimensions (``dims``), and variables (``data_vars``), and include their names, attributes, dtypes, and shapes. The list of supported `filetypes` that can be extracted using **yadg** can be found in the left sidebar. For more information about the `extractor` concept, see |marda_extractors|_. .. _parser mode: `Parser` mode ------------- The main purpose of **yadg** is to process a bunch of raw data files according to a provided `dataschema` into a well-defined, annotated, FAIR-data file called `datagram`. As of ``yadg-5.0``, the `datagram` is a |NetCDF|_ file. To use **yadg** like this, it should be invoked as follows: .. code-block:: bash yadg process infile [outfile] Where ``infile`` corresponds to the ``json`` or ``yaml`` file containing the `dataschema`, and the optional ``outfile`` is the filename to which the created `datagram` should be saved (it defaults to ``datagram.nc``). In this fully-featured usage pattern via `dataschema`, **yadg** offloads the responsibility of data extraction and normalisation to its modules, called `parsers`. The currently implemented `parsers` are documented in the sidebar. `Dataschema` from presets ````````````````````````` This alternative form of using **yadg** in `parser` mode is especially useful when processing data organised in a consistent folder structure between several experimental runs. The user should prepare a `preset` file, which then gets patched to a `dataschema` file using the provided folder path: .. code-block:: bash yadg preset infile folder [outfile] Where ``infile`` is the `preset`, ``folder`` is the folder path for which the `preset` should be modified, and the optional ``outfile`` is the filename to which the created `dataschema` should be saved. Alternatively, if the `dataschema` should be processed immediately, the ``--process`` (or ``-p``) switch can be used with the following usage pattern: .. code-block:: bash yadg preset -p infile folder [outfile.json] This syntax will process the created `dataschema` immediately, and the `datagram` will be saved to ``outfile.json`` instead. Finally, the raw data files in the processed ``folder`` can be archived, checksumed, and referenced in the `datagram`, by using the following pattern: .. code-block:: bash yadg preset -p -a infile folder [outfile.json] This will create a `datagram` in ``outfile.json`` as well as a ``outfile.zip`` archive from the whole contents of the specified ``folder``. `Dataschema` version updater ```````````````````````````` If you'd like to update a `dataschema` from a previous version of **yadg** to the current latest one, use the following syntax: .. code-block:: bash yadg update infile [outfile] This will update the `dataschema` specified in ``infile`` and save it to ``outfile``, if provided. .. _NetCDF: https://www.unidata.ucar.edu/software/netcdf/ .. _marda_extractors: https://github.com/marda-alliance/metadata_extractors .. |NetCDF| replace:: ``NetCDF`` .. |marda_extractors| replace:: MaRDA Metadata Extractors WG