dgpost features

Note

For an overview of the data-processing features within dgpost, see the documentation of the dgpost.transform module.

Pandas compatibility

One of the design goals of dgpost was to develop a library that can be used with yadg datagrams, the pd.DataFrames created by dgpost, as well as with any other pd.DataFrames, created e.g. by parsing an xlsx or csv file.

This is achieved by placing some necessary requirements on the functions in the dgpost.transform module. The key requirements are:

  • the function must process pint.Quantity objects,

  • the function must return data in a dict[str, pint.Quantity] format.

If these requirements are met, the decorator function load_data() can be used to either extract data from the supplied pd.DataFrame, or wrap directly supplied data into pint.Quantity objects, and supply those into the called transform function transparently to the user.

Note

As of dgpost-2.0, dgpost internally converts the loaded tables into pd.DataFrames with pd.MultiIndex as the column index, if necessary. All namespaces separated by -> will be split into a pd.MultiIndex, and the units of those columns will be organised accordingly. This means dgpost can read pd.MultiIndexed tables, and extract data from tables with standard a pd.Index as well as pd.MultiIndex seamlessly.

Units and uncertainties

Another key objective of dgpost is to allow and encourage annotating data by units as well as error estimates / uncertainties. The design philosophy here is that by building unit- and uncertainty- awareness into the toolchain, users will be encouraged to use it, and in case of uncertainties, be more thoughtful about the limitations of their data.

As discussed in the documentation of yadg, when experimental data is loaded from datagrams, it is annotated with units by default. In dgpost, the units for the data in each column in each table are stored as a dict[str, str] in the "units" key of the df.attrs attribute, and they are extracted and exported appropriately when the table is saved.

If the df.attrs attribute does not contain the "units" entry, dgpost assumes the underlying data is unitless, and the default units selected for each function in the dgpost.transform library by its developers are applied to the data. Internally, all units are handled using yadg’s custom pint.UnitRegistry, via the pint library.

Uncertainties are handled using the linear uncertainty propagation library, uncertainties. As the input data for the functions in the dgpost.transform module is passed using pint.Quantity objects, which supports the uncetainties.unumpy arrays, uncertainty handling is generally transparent to both user and developer. The notable exceptions here are transformations using fitting functions from the scipy library, where arrays containing floats are expected - this has to be handled explicitly by the developer.

Uncertainty management, including stripping or setting uncertainties to user-provided values, can be done on a per-column or per-namespace basis using the set_uncertainty() function from the dgpost.transform.table module. Both absolute and relative errors can be supplied. The parser is fully unit-aware.

When saving tables created in dgpost, the units are appended to the column names (csv/xlsx) or stored in the table (pkl/json). When exporting a pd.MultiIndexed table to csv/xlsx, units will be appended to the top-level index. The uncertainties may be optionally dropped completely from the exported table; see the dgpost.utils.save module. This is especially handy for post-processing of tables in spreadsheets.

Provenance

Provenance tracking is implemented in dgpost using the "meta" entry of the df.attrs attribute of the created pd.DataFrame. This entry is exported when the pd.DataFrame is saved as pkl/json, and contains dgpost version information as well as a copy of the recipe used to create the saved object.