ProLoaF’s configuration file is written using JSON. As such, whitespace is allowed and ignored in your syntax.
We mostly use strings, boolean, and null
, to specify the paths, and parameters of our targeted forecasting project.
Numbers and booleans should be unquoted.
For better readability, we assume in the examples below, that your working directory is set to the main project path of the cloned repository.
A new config file can be generated automatically by using:
python src\configmaker.py --new targets/<STATION-NAME>/config.json
or manually.
To modify the file, you can change parameters with our helper or apply modifications manually.
When using the helper, your modifications must be set in configmaker.py
. You can then run:
python src\configmaker.py --mod targets/<STATION-NAME>/config.json
The default location of the main configuration file is ./targets/
or better ./targets/<STATION>
.
The best practice is to generate sub-directories for each forecasting exercise, i.e. a new station.
As the project originated from electrical load forecasting on substation-level,
the term station or target-station is used to refer to the location or substation identifier
from which the measurement data is originating.
Most of the example scripts in ProLoaF use the config file for training and evaluation,
as it serves as a central place for parametrization.
At this stage you should have the main config file for your forecasting project: ./targets/<STATION>/config.json
.
ProLoaF comes with basic functions to parse,
edit and store the config file. We make use of this when calling e.g. our example training script:
$ python src\train.py -s opsd
The flag -s
allows us to specify the station name (=target directory) through the string that follows,
i.e. opsd. The ‘train’ script will expect and parse the config.json given in the target directory.
You can also manually specify the path to the config file by adding -c <CONFIG_PATH>
to the above mentioned
statement.
Note: If not otherwise specified, during training, per default the neural network maximizes the Maximum Likelihood Estimation of Gaussian Parameters, for a 95% prediction interval. This so-called loss criterion can be changed to any metric that quantifies the (probabilistic) performance of the forecast. A common non-parametric option is the quantile loss. You can apply quantile loss criterion as follows:
$ python src\train.py -s opsd --quantiles 0.025 0.975
Here we have specified the 95% prediction interval, by setting
q1=0.025
andq2=0.975
.
See more detailed descriptions and further loss options in the full list of parameters.
Through the config file the user specifies the data source location, and the directories for logging, exporting performance analyses and most importantly, the trained RNN model binary.
Path Specs:
{
"data_path": "./data/<FILE-NAME>.csv",
"evaluation_path": "./oracles/eval_<MODEL-NAME>/",
"output_path": "./oracles/",
"exploration_path": "./targets/sege/tuning.json",
"log_path": "./logs/"
}
The output-, exploration- and log- paths may stay unchanged, but the data path and evaluation path must be specified.
Note: The data path should contain a csv file that includes all input data column-wise in any time-resolution. In our example train-& evaluation scripts, the first column is treated as datetime information and declared as pandas datetime index. oracles is the default naming of the output directory, in which the prediction model and predictive performance are stored.
ProLoaF is a machine-learning based timeseries forecasting project. The supervised learning requires data with
a (pandas) datetime index. Typical time resolutions are:ms
, s
, m
, h
, d
.
Endogenous (lagged) inputs and exogenous (=explanatory) variables that affect the future explained variable,
are split into multiple windows with the sequence length of history_horizon
and fed to the encoder.
For better understanding, we recommend
the illustrative guide
on a similar sequence-to-sequence architecture authored by Ben Trevett.
The time step size is equal to the underlying timeseries data resolution. The example files apply day-ahead forecasting in hourly resolution.
Note: A forecast that produces a predicted sequence starting from the forecasting execution time and including the next day, is >=24h, depending on the forecast execution time t, e.g. a day-ahead forecast executed at 9 am, shall produce a 40 hour horizon.
Following parameters configure the input- and output- sequence length:
{
"history_horizon": 42,
"forecast_horizon": 24
}
In machine learning, we typically split available data to train the model and test its performance. With the training set, the model is parameterized. By checking against validation data, we track the fitting process. In a final step, the test set serves to assess and confirm the predictive power of the trained model. To configure the size of each mentioned set, specify:
train_split:
Given an input dataframe df, all timestamps before the specified split are used for training:
df[:train_split*df.shape[0]]
.
validation_split:
For validation during training, we use all data between train and validation limiters in the dataframe df:
df[train_split*df.shape[0]:validation_split*df.shape[0]]
.
Note: The test_split is set per default, through the remaining input data from df:
df[validation_split*df.shape[0]:]
.
"feature_groups": [
{
"name": "main",
"scaler": [
"robust",
15,
85
],
"features": [
"<COLUMN-IDENTIFIER-1>"
]
}
{
"name": "add",
"scaler": [
"minmax",
-1.0,
1.0
],
"features": [
"<COLUMN-IDENTIFIER-2>"
]
}
{
"name": "aux",
"scaler": null,
"features": [
"<COLUMN-IDENTIFIER-3>",
"<COLUMN-IDENTIFIER-4>"
]
}
]
"<COLUMN-IDENTIFIER-1>",
"<COLUMN-IDENTIFIER-2>"
],
"decoder_features": [
"<COLUMN-IDENTIFIER-3>",
"<COLUMN-IDENTIFIER-4>"
],
### RNN Cell Type
- GRU: trains typically faster (per epoch) with similar results compared to LSTM cells.
- LSTM
```json
{
"history_horizon": 42,
"forecast_horizon": 24
}
{
"max_epochs": 1,
"batch_size": 2,
"learning_rate": 0.0001,
"core_layers": 1,
"rel_linear_hidden_size": 1.0,
"rel_core_hidden_size": 1.0,
"dropout_fc": 0.4,
"dropout_core": 0.3
}
max_epochs:….
batch_size: …
learning_rate: …
core_layers: …
rel_linear_hidden_size: …
rel_core_hidden_size: …
dropout_fc: …
dropout_core: …
Configure which hyperparameters are optimized and specify each parameters search space through a separate tuning config.
Note: Best practice is to save the tuning.config in the same directory in which the main config is given. However, by setting a specific exploration_path, the user can direct to a different location on the machine.
Some text on cuda id
The following table summarizes the default parameters of the main config file:
Config Params
Parameter | Data Type | Value Range |
---|---|---|
history_horizon | int | > 0 |
forecast_horizon | int | > 0 |
Shell Params Upon Script Execution
Example:
$ python src\train.py -s opsd --rmse
or
$ python src\train.py -s opsd --smoothed_quantiles 0.025 0.975
Parameter | Data Type | Value Range | Short Description |
---|---|---|---|
–< loss > | string in shell | {mse, mape, rmse, mis, nll_gauss, quantiles, smoothed_quantiles, crps} | Set the loss function for training |
–ci | boolean | True or False | Enables execution mode optimized for GitLab’s CI |
–logname | str | " " | Name of the run, displayed in Tensorboard |
Tensorboard utilization
This project uses Tensorboard to display in-depth information about each train run. Information about the run itself like training time and validation
loss are visible in the Scalars
tab. The HParams
tab allows sorting of runs by hyper parameters as well as parallel and scatter plots. To define which
data should be logged, a log.json
of following structure is used in the targets/ folder of each station:
{
"features": [
{
"name": "time_stamp"
},
{
"name": "train_loss"
},
{
"name": "val_loss"
},
{
"name": "total_time"
}
]
}
Install Tensorboard and run the Tensorboard command-line interface as described here. The default directory that runs get saved in is the runs/
directory.
To display runs in Tensorboard, use tensorboard --logdir=runs
, then open a browser and navigate to 127.0.0.1:6006