basedataset

class hy2dl.datasetzoo.basedataset.BaseDataset(cfg: Config, time_period: str, gauge_id: str | list[str] | None = None)

Bases: Dataset

Class to read and process data.

This class is inherited by other subclasses (e.g. CAMELS_US, CAMELS_GB, …) to read and process the data. The class contains all the common operations that need to be done, independently of which database is being used.

Parameters:

cfg (Config) – Configuration file.
time_period ({'training', 'validation', 'testing'}) – Defines the period for which the data will be loaded.
gauge_id (Optional[str | list[str]], default=None) – Id of gauge(s) to be loaded.

static collate_fn(batch)

Custom collate function to construct batches

Because we are using getitems instead of getitem, we are already constructing the batch inside the getitems function. Therefore, the collate function does not need any further processing.

static dask_worker_init_fn(worker_id)

Initialization function for Dask workers.

Note: This function is called inside each PyTorch worker

static flatten_dict_values(d: dict) → list: Flatten the values of a (nested) dictionary into a list.

setup_dataset(check_nan=True, path_scaler: Path | str | None = None)

Get data ready for training or evaluation.

This is the function you should call to load and process the dataset, and get it ready for use. It processes, validates, maps, standardizes, and optimizes the dataset that will be sent to the model.

The setup follows these steps:

Load Data: Either processes the dataset from scratch or loads an existing pre-processed one. Processing the dataset is done in _process_df and includes:
- reading the raw data
- selecting the time periods and variables of interest
- adding additional and lagged features (if specified)
- reindexing the data to have a continuous time index.
Validate Samples: Look for valid samples or load a pre-computed list. Criteria for valid samples is defined in _valid_samples_mask.
Map indexes: Map the valid samples to the corresponding indexes in the dataset. This is necessary for efficient data loading during training.
Calculate statistics: Calculate data statistics that are used for standardization.
Finalize: runs _finalize_setup() to optimize memory and data access speed.

Note: Even if cfg.dataset_in_ram = True, disk-based datasets are kept lazy for sample validation, mapping, and scaler calculation. If cfg.dataset_in_ram = True and enough RAM is available, the datasets will be moved to RAM in _finalize_setup().

Parameters:

check_nan (bool, default=True) – Check for nan values during validate_samples.
path_scaler (Path or str, optional) – Path to saved scaler.yml file.

static unique_values(x: list | dict[str, list | dict[str, list]] | None) → list[str]

Retrieve unique values

Parameters:: x (list | dict[str, list | dict[str, list]] | None) – Data to retrieve unique variables from.
Returns:: List of unique values
Return type:: List[str]