2.1. Data Load

This section introduces the data loader module of PiML. Data loader is usually the first step of the whole experiment, and it supports choosing a built-in dataset or external dataset to start your experiment.

2.1.1. Built-in Dataset

There are several datasets that are already uploaded into PiML. These are:

  • ‘CoCircles’: Gaussian data with a spherical decision boundary for binary classification, generated via Scikit-Learn.

  • ‘Friedman’: ‘Friedman #1’ regression problem, generated via Scikit-Learn.

  • ‘BikeSharing’: Refer to https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset.

  • ‘TaiwanCredit’: Refer to https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients.

  • ‘CaliforniaHousing_raw’: Refer to https://developers.google.com/machine-learning/crash-course/california-housing-data-description.

  • ‘CaliforniaHousing_trim1’: ‘CaliforniaHousing_raw’ dataset with the feature ‘AveOccup’ trimmed by upper threshold 5.

  • ‘CaliforniaHousing_trim2’: ‘CaliforniaHousing_raw’ dataset with the features ‘AveRooms’, ‘AveBedrms’, ‘Population’, and ‘AveOccup’ trimmed by upper threshold quantile (0.98).

  • ‘SimuCredit’: A credit simulation data for fairness testing.

  • ‘SolasSimu1’: A simulated dataset, modified from the ‘Friedman #1’ regression problem. The covariates used for modeling are ‘Segment’, ‘x1’, ‘x2’, …, and ‘x5’, the response ‘Label’ is binary and it is a classification problem. The rest variables are demographic variables used for testing fairness. The data is contributed by Solas-AI (https://github.com/SolasAI/solas-ai-disparity).

  • ‘SolasHMDA’: A preprocessed sample of the 2018 Home Mortgage Disclosure Act (HMDA) data. The HMDA dataset includes information about nearly every home mortgage application in the United States.

You could load these datasets using the code below, where data=" " indicates the dataset to be included. For example,

exp.data_loader(data="CoCircles")
X0 X1 target
0 -0.783526 0.502161 0.0
1 0.297809 0.658405 1.0
2 0.468272 0.500653 1.0
... ... ...
1997 -0.542930 -0.583517 1.0
1998 -0.871481 -0.491301 0.0
1999 -0.323963 -0.719150 0.0

2.1.2. External Dataset (csv files)

There are two ways of loading csv files in PiML.

  • In low-code mode, you could just click the “upload new data” button to upload your data.

  • In high-code mode, you can use pandas to wrap your data and input it to the data loader. For example:

data = pd.read_csv('https://github.com/SelfExplainML/PiML-Toolbox/blob/main/datasets/BikeSharing.csv?raw=true')
exp.data_loader(data=data)
season yr mnth hr holiday weekday workingday weathersit temp atemp hum windspeed cnt
0 1.0 0.0 1.0 0.0 0.0 6.0 0.0 1.0 0.24 0.2879 0.81 0.0000 16.0
1 1.0 0.0 1.0 1.0 0.0 6.0 0.0 1.0 0.22 0.2727 0.80 0.0000 40.0
2 1.0 0.0 1.0 2.0 0.0 6.0 0.0 1.0 0.22 0.2727 0.80 0.0000 32.0
... ... ... ... ... ... ... ... ... ... ... ... ...
17377 1.0 1.0 12.0 21.0 1.0 1.0 1.0 1.0 0.26 0.2576 0.60 0.1642 90.0
17378 1.0 1.0 12.0 22.0 1.0 1.0 1.0 1.0 0.26 0.2727 0.56 0.1343 61.0
17379 1.0 1.0 12.0 23.0 0.0 1.0 1.0 1.0 0.26 0.2727 0.65 0.1343 49.0

2.1.3. External Dataset (Spark file)

Starting from PiML-0.6.0, we support to load data using spark backend. The data should be in the format of spark dataframe. For example, you could load the data from a parquet file using the following code:

exp.data_loader(data="./myfile.parquet", spark=True, spark_sample_size=10000, spark_random_state=0)
Y X0 X1 X2 X3 X4 X5 X6 X7 X8 X9
0 0.0 -0.162726 -0.989380 -0.977290 0.022444 -0.833418 -0.897849 0.931033 0.718005 -0.695946 -0.998672
1 1.0 0.883336 -0.443349 -0.628205 0.383016 -0.782193 -0.470701 0.950189 0.278926 0.041356 -0.204163
2 1.0 0.549002 -0.718085 0.934676 0.722246 0.235314 -0.914188 0.401711 0.826569 0.049154 -0.291550
3 0.0 -0.759445 0.509802 0.770044 -0.799497 0.517969 -0.965879 0.934110 0.230116 0.104878 -0.408100
4 1.0 0.858583 -0.468189 0.656293 0.970217 0.566793 0.037980 -0.867851 -0.055172 -0.123488 -0.594408
... ... ... ... ... ... ... ... ... ... ... ...
2117 1.0 -0.325774 -0.298574 -0.984911 -0.660747 -0.894813 0.971714 -0.263716 0.386797 0.177803 0.546942
2118 1.0 0.631497 -0.209968 -0.060872 -0.981552 0.319807 -0.552832 -0.256842 0.059649 -0.120317 0.194922
2119 1.0 0.918874 0.502420 0.759211 0.143963 0.851615 0.530987 0.295923 -0.576709 -0.472256 -0.470885
2120 1.0 -0.505041 -0.592865 0.458442 0.022174 -0.396257 0.430562 0.394588 0.286274 0.493732 -0.581601
2121 0.0 0.304829 0.028269 0.903502 -0.400436 -0.932546 -0.409703 0.272170 0.739227 -0.744509 -0.398449

The argument spark=True tells the program that we would load data using spark backend, data denotes the file path, and spark_sample_size is expected sample size. By default, we do purly random subsampling, and the sample size will be transformed into the frac parameter in spark. For example, if the original data has 100000 samples and the spark_sample_size is 10000, then the frac parameter will be set to 0.1. However, due to the working mechanism of spark, the actual sample size may be slightly different from the expected sample size.

If you want to do stratified sampling, you could specify the stratified feature using spark_sample_by_feature parameter. For example, if you want to do stratified sampling using the feature ‘Y’, you could use the following code:

exp.data_loader(data="./myfile.parquet", spark=True, spark_sample_size=10000,
                spark_sample_by_feature='Y', spark_random_state=0)
Y X0 X1 X2 X3 X4 X5 X6 X7 X8 X9
0 0.0 -0.162726 -0.989380 -0.977290 0.022444 -0.833418 -0.897849 0.931033 0.718005 -0.695946 -0.998672
1 1.0 0.883336 -0.443349 -0.628205 0.383016 -0.782193 -0.470701 0.950189 0.278926 0.041356 -0.204163
2 1.0 0.549002 -0.718085 0.934676 0.722246 0.235314 -0.914188 0.401711 0.826569 0.049154 -0.291550
3 0.0 -0.759445 0.509802 0.770044 -0.799497 0.517969 -0.965879 0.934110 0.230116 0.104878 -0.408100
4 1.0 0.858583 -0.468189 0.656293 0.970217 0.566793 0.037980 -0.867851 -0.055172 -0.123488 -0.594408
... ... ... ... ... ... ... ... ... ... ... ...
2117 1.0 -0.325774 -0.298574 -0.984911 -0.660747 -0.894813 0.971714 -0.263716 0.386797 0.177803 0.546942
2118 1.0 0.631497 -0.209968 -0.060872 -0.981552 0.319807 -0.552832 -0.256842 0.059649 -0.120317 0.194922
2119 1.0 0.918874 0.502420 0.759211 0.143963 0.851615 0.530987 0.295923 -0.576709 -0.472256 -0.470885
2120 1.0 -0.505041 -0.592865 0.458442 0.022174 -0.396257 0.430562 0.394588 0.286274 0.493732 -0.581601
2121 0.0 0.304829 0.028269 0.903502 -0.400436 -0.932546 -0.409703 0.272170 0.739227 -0.744509 -0.398449

Here, ‘Y’ should be a categorical feature and it must be one of the columns in the data. By default, the ratios between different categories are the same. If you want to specify the ratios, you could use the spark_sample_fractions parameter, as shown below:

exp.data_loader(data="./myfile.parquet", spark=True, spark_sample_size=10000,
                spark_sample_by_feature='Y', spark_sample_fractions={0.0: 1, 1.0: 5},
                spark_random_state=0)
Y X0 X1 X2 X3 X4 X5 X6 X7 X8 X9
0 0.0 -0.162726 -0.989380 -0.977290 0.022444 -0.833418 -0.897849 0.931033 0.718005 -0.695946 -0.998672
1 1.0 0.883336 -0.443349 -0.628205 0.383016 -0.782193 -0.470701 0.950189 0.278926 0.041356 -0.204163
2 1.0 0.549002 -0.718085 0.934676 0.722246 0.235314 -0.914188 0.401711 0.826569 0.049154 -0.291550
3 0.0 -0.759445 0.509802 0.770044 -0.799497 0.517969 -0.965879 0.934110 0.230116 0.104878 -0.408100
4 1.0 0.858583 -0.468189 0.656293 0.970217 0.566793 0.037980 -0.867851 -0.055172 -0.123488 -0.594408
... ... ... ... ... ... ... ... ... ... ... ...
2117 1.0 -0.325774 -0.298574 -0.984911 -0.660747 -0.894813 0.971714 -0.263716 0.386797 0.177803 0.546942
2118 1.0 0.631497 -0.209968 -0.060872 -0.981552 0.319807 -0.552832 -0.256842 0.059649 -0.120317 0.194922
2119 1.0 0.918874 0.502420 0.759211 0.143963 0.851615 0.530987 0.295923 -0.576709 -0.472256 -0.470885
2120 1.0 -0.505041 -0.592865 0.458442 0.022174 -0.396257 0.430562 0.394588 0.286274 0.493732 -0.581601
2121 0.0 0.304829 0.028269 0.903502 -0.400436 -0.932546 -0.409703 0.272170 0.739227 -0.744509 -0.398449

The spark_sample_fractions parameter is a dictionary, where the keys are the categories and the values are the ratios. In the above example, the ratio between category 0.0 and 1.0 is 1:5.

2.1.4. Examples

The full example codes of this section can be found in the following link.