2.3. Data Preparation¶
This section provides an introduction to the data preparation module of PiML. Within this module, you have the ability to configure various aspects of the data, including task-specific settings such as the target variable, task type, and sample weight. Additionally, users can customize the train-test split by specifying parameters such as the split method, split ratio, and random seed. This allows for greater control and flexibility in preparing the data according to specific requirements and preferences.
2.3.1. Basic Settings¶
The basic settings include the target variable, task type, and sample weight variable, as follows.
target
: The target variable for training is typically selected as the last column in the dataset by default. However, if the target variable is located in a different column, you have the option to specify the column name. This allows you to explicitly identify the desired column as the target variable, ensuring that the correct data is used for training.task_type
: The data preparation module in PiML supports two types of tasks: “regression” (continuous regression) and “classification” (including binary regression). By default, the task type is automatically determined based on the feature type of the target variable. If the target variable is numerical, the task type is assumed to be “regression”. On the other hand, if the target variable is categorical with two categories, the task type is set to “classification”. By explicitly specifying the appropriate task type, you can ensure that the module applies the most suitable models and algorithms to handle the specific task effectively.sample_weight
: By default, the sample weight variable is set toNone
. This means that no specific column is assigned as the sample weight. However, you have the option to specify a column name in the dataset as the sample weight variable. This allows you to assign different weights to individual samples, which can be useful in scenarios where certain samples are more important or carry more significance in the analysis or modeling process.
exp.data_prepare(target='cnt', task_type='regression', sample_weight=None)
Config | Value | |
---|---|---|
0 | Excluded columns | [] |
1 | Target variable | cnt |
2 | Sample weight | None |
3 | Task type | regression |
4 | Split method | random |
5 | Test ratio | 0.2 |
6 | Random state | 0 |
In data preparation, all categorical features are preprocessed using the ordinal encoder, and numerical features are standardized to be within 0 and 1 using a min-max scaler.
2.3.2. Train-test Splits¶
PiML supports four different methods for splitting datasets into training and testing sets. Once the split is performed, an assessment of the distributional difference between the train and test sets is conducted using the energy distance metric. The energy distance quantifies the dissimilarity between the empirical distribution functions (EDFs) of the two data sets. To reduce the computational burden, we subsample at most 10000 samples from the training and testing sets to calculate the energy distance. Note that categorical features are first encoded using the ordinal encoder and then transformed to be within 0 and 1 before the energy distance is calculated.
2.3.2.1. Random Split¶
The following code demonstrates how to utilize the random split method in PiML to divide the training and testing datasets. The default test ratio is set to 0.2, and the default random seed is 0.
exp.data_prepare(target='cnt', task_type='regression', sample_weight=None,
split_method='random', test_ratio=0.2, random_state=0)
Config | Value | |
---|---|---|
0 | Excluded columns | [] |
1 | Target variable | cnt |
2 | Sample weight | None |
3 | Task type | regression |
4 | Split method | random |
5 | Test ratio | 0.2 |
6 | Random state | 0 |
This function would output a configuration table, listing the basic information of the data preparation.
2.3.2.2. Outer-sample-based Split¶
This method splits samples based on the Euclidean distance between each sample to the data center. Samples that are far away from the center are chosen as the test set. In the following example, the test_ratio
parameter is used to select the farthest 20% of samples as the test set.
exp.data_prepare(target='cnt', task_type='regression', sample_weight=None,
split_method='outer-sample', test_ratio=0.2, random_state=0)
Config | Value | |
---|---|---|
0 | Excluded columns | [] |
1 | Target variable | cnt |
2 | Sample weight | None |
3 | Task type | regression |
4 | Split method | outer-sample |
5 | Test ratio | 0.2 |
6 | Random state | 0 |
2.3.2.3. KMeans-based Split¶
This method first fits the KMeans clustering algorithm using all predictors. Then, for each cluster, we randomly select a percent of samples and test samples, and the remaining samples are used as training samples. The test ratio of each cluster can be set as a list of ratios of each cluster. The number of clusters is determined by the length of the list. In this case, the test_ratio
list contains three elements, indicating that the KMeans clustering algorithm will have three clusters (K=3). The test ratios for the first, second, and third clusters are set to 0.0, 1.0, and 0.0, respectively.
exp.data_prepare(target='cnt', task_type='regression', sample_weight=None,
split_method='kmeans', test_ratio=[0.0, 1.0, 0.0], random_state=0)
Config | Value | |
---|---|---|
0 | Excluded columns | [] |
1 | Target variable | cnt |
2 | Sample weight | None |
3 | Task type | regression |
4 | Split method | kmeans |
5 | Test ratio | 0.420277 |
6 | Random state | 0 |
This algorithm is useful for generating train and test sets with distinct distributional characteristics. The test ratio in the configuration table shows the actual ratio of the test set over the whole sample.
2.3.2.4. Manual Split¶
In addition to the aforementioned split methods, PiML also provides support for custom sample index splits. Here is an example of how to use custom sample indices for splitting the training and testing datasets.
custom_train_idx = np.arange(0, 16000)
custom_test_idx = np.arange(16000, 17379)
exp.data_prepare(target='cnt', task_type='regression', sample_weight=None,
train_idx=custom_train_idx, test_idx=custom_test_idx)
Config | Value | |
---|---|---|
0 | Excluded columns | [] |
1 | Target variable | cnt |
2 | Sample weight | None |
3 | Task type | regression |
4 | Split method | manual |
5 | Test ratio | 0.079349 |
6 | Random state | 0 |
You can customize the train_idx
and test_idx
according to your specific requirements. These lists contain the indices of the samples that will be included in the training and testing sets, respectively. By defining your own set of sample indices, you have full control over the composition of the training and testing datasets.
2.3.3. Examples¶
The full example codes of this section can be found in the following link.