extract: extract and interpolate data into tables

Code author: Peter Kraus

The function dgpost.utils.extract.extract() processes the below specification in order to extract the required data from the supplied datagram.

pydantic model dgbowl_schemas.dgpost.recipe.Extract

Extract columns from loaded files into tables, interpolate as necessary.

Show JSON schema
{
   "title": "Extract",
   "description": "Extract columns from loaded files into tables, interpolate as necessary.",
   "type": "object",
   "properties": {
      "into": {
         "title": "Into",
         "type": "string"
      },
      "from": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "From"
      },
      "at": {
         "anyOf": [
            {
               "$ref": "#/$defs/At"
            },
            {
               "type": "null"
            }
         ],
         "default": null
      },
      "columns": {
         "anyOf": [
            {
               "items": {
                  "$ref": "#/$defs/Column"
               },
               "type": "array"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Columns"
      },
      "constants": {
         "anyOf": [
            {
               "items": {
                  "$ref": "#/$defs/Constant"
               },
               "type": "array"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Constants"
      }
   },
   "$defs": {
      "At": {
         "additionalProperties": false,
         "properties": {
            "steps": {
               "default": null,
               "items": {
                  "type": "string"
               },
               "title": "Steps",
               "type": "array"
            },
            "indices": {
               "default": null,
               "items": {
                  "type": "integer"
               },
               "title": "Indices",
               "type": "array"
            },
            "timestamps": {
               "default": null,
               "items": {
                  "type": "number"
               },
               "title": "Timestamps",
               "type": "array"
            }
         },
         "title": "At",
         "type": "object"
      },
      "Column": {
         "additionalProperties": false,
         "properties": {
            "key": {
               "title": "Key",
               "type": "string"
            },
            "as": {
               "title": "As",
               "type": "string"
            }
         },
         "required": [
            "key",
            "as"
         ],
         "title": "Column",
         "type": "object"
      },
      "Constant": {
         "additionalProperties": false,
         "properties": {
            "value": {
               "title": "Value"
            },
            "as": {
               "title": "As",
               "type": "string"
            },
            "units": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "title": "Units"
            }
         },
         "required": [
            "value",
            "as"
         ],
         "title": "Constant",
         "type": "object"
      }
   },
   "additionalProperties": false,
   "required": [
      "into"
   ]
}

Config:
  • extra: str = forbid

Validators:
field into: str [Required]

Name of a new, or existing / loaded table into which the extraction happens.

Validated by:
field from_: str | None = None (alias 'from')

Name of the source object for the extracted data.

Validated by:
field at: At | None = None

Specification of the steps (or data indices) from which data is to be extracted.

Validated by:
field columns: Sequence[Column] | None = None

Specifications for the columns to be extracted, including new headers.

Validated by:
field constants: Sequence[Constant] | None = None

Specifications for additional columns containing data constants, including units.

Validated by:
validator check_one_input  »  all fields

Note

The keys from and into are not processed by extract(), they should be used by its caller to supply the requested datagram and assign the returned pd.DataFrame into the correct variable.

Handling of sparse data depends on the extraction format specified:

  • for direct extraction, if the value is not present at any of the timesteps specified in at, a NaN is added instead

  • for interpolation, if a value is missing at any of the timesteps specified in at or in the pd.DataFrame index, that timestep is masked and the interpolation is performed from neighbouring points

Interpolation of uc.ufloat is performed separately for the nominal and error component.

Units are added into the attrs dictionary of the pd.DataFrame on a per-column basis.

Data from multiple datagrams can be combined into one pd.DataFrame using a YAML such as the following example:

load:
  - as: norm
    path: normalized.dg.json
  - as: sparse
    path: sparse.dg.json
extract:
  - into: df
    from: norm
    at:
        step: "a"
    columns:
      - key: raw->T_f
        as: rawT
  - into: df
    from: sparse
    at:
        steps: b1, b2, b3
    direct:
      - key: derived->xout->*
        as: xout

In this example, the pd.DataFrame is created with an index corresponding to the timestamps of step: "a" of the datagram. The values specified using columns in the first section are entered directly, after renaming the column names.

The data pulled out of the datagram in the second step using the prescription in at are interpolated onto the index of the existing pd.DataFrame.

dgpost.utils.extract.get_step(obj: DataFrame | DataTree | dict | None, at: dict | None = None) DataFrame | DataTree | list[dict] | None
dgpost.utils.extract.get_constant(spec: dict, ts: Index)
dgpost.utils.extract.extract(obj: dict | DataFrame | DataTree | None, spec: dict, index: Index | None = None) DataFrame
dgpost.utils.extract.extract_obj(obj: Any, columns: list[dict]) list[Series]
dgpost.utils.extract.extract_obj(obj: DataFrame, columns: list[dict]) list[Series]
dgpost.utils.extract.extract_obj(obj: DataTree, columns: list[dict]) list[Series]
dgpost.utils.extract.extract_obj(obj: list, columns: list[dict]) list[Series]
dgpost.utils.extract.extract_obj(obj: tuple, columns: list[dict]) list[Series]