2.1. Data Load¶
This section introduces the data loader module of PiML. Data loader is usually the first step of the whole experiment, and it supports choosing a built-in dataset or external dataset to start your experiment.
2.1.1. Built-in Dataset¶
There are several datasets that are already uploaded into PiML. These are:
‘CoCircles’: Gaussian data with a spherical decision boundary for binary classification, generated via Scikit-Learn.
‘Friedman’: ‘Friedman #1’ regression problem, generated via Scikit-Learn.
‘BikeSharing’: Refer to https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset.
‘TaiwanCredit’: Refer to https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients.
‘CaliforniaHousing_raw’: Refer to https://developers.google.com/machine-learning/crash-course/california-housing-data-description.
‘CaliforniaHousing_trim1’: ‘CaliforniaHousing_raw’ dataset with the feature ‘AveOccup’ trimmed by upper threshold 5.
‘CaliforniaHousing_trim2’: ‘CaliforniaHousing_raw’ dataset with the features ‘AveRooms’, ‘AveBedrms’, ‘Population’, and ‘AveOccup’ trimmed by upper threshold quantile (0.98).
‘SimuCredit’: A credit simulation data for fairness testing.
‘SolasSimu1’: A simulated dataset, modified from the ‘Friedman #1’ regression problem. The covariates used for modeling are ‘Segment’, ‘x1’, ‘x2’, …, and ‘x5’, the response ‘Label’ is binary and it is a classification problem. The rest variables are demographic variables used for testing fairness. The data is contributed by Solas-AI (https://github.com/SolasAI/solas-ai-disparity).
‘SolasHMDA’: A preprocessed sample of the 2018 Home Mortgage Disclosure Act (HMDA) data. The HMDA dataset includes information about nearly every home mortgage application in the United States.
You could load these datasets using the code below, where data=" "
indicates the dataset to be included. For example,
exp.data_loader(data="CoCircles")
2.1.2. External Dataset (csv files)¶
There are two ways of loading csv files in PiML.
In low-code mode, you could just click the “upload new data” button to upload your data.
In high-code mode, you can use pandas to wrap your data and input it to the data loader. For example:
data = pd.read_csv('https://github.com/SelfExplainML/PiML-Toolbox/blob/main/datasets/BikeSharing.csv?raw=true')
exp.data_loader(data=data)
2.1.3. External Dataset (Spark file)¶
Starting from PiML-0.6.0, we support to load data using spark backend. The data should be in the format of spark dataframe. For example, you could load the data from a parquet file using the following code:
exp.data_loader(data="./myfile.parquet", spark=True, spark_sample_size=10000, spark_random_state=0)
The argument spark=True
tells the program that we would load data using spark backend, data
denotes the file path, and spark_sample_size
is expected sample size. By default, we do purly random subsampling, and the sample size will be transformed into the frac
parameter in spark. For example, if the original data has 100000 samples and the spark_sample_size
is 10000, then the frac
parameter will be set to 0.1. However, due to the working mechanism of spark, the actual sample size may be slightly different from the expected sample size.
If you want to do stratified sampling, you could specify the stratified feature using spark_sample_by_feature
parameter. For example, if you want to do stratified sampling using the feature ‘Y’, you could use the following code:
exp.data_loader(data="./myfile.parquet", spark=True, spark_sample_size=10000,
spark_sample_by_feature='Y', spark_random_state=0)
Here, ‘Y’ should be a categorical feature and it must be one of the columns in the data. By default, the ratios between different categories are the same. If you want to specify the ratios, you could use the spark_sample_fractions
parameter, as shown below:
exp.data_loader(data="./myfile.parquet", spark=True, spark_sample_size=10000,
spark_sample_by_feature='Y', spark_sample_fractions={0.0: 1, 1.0: 5},
spark_random_state=0)
The spark_sample_fractions
parameter is a dictionary, where the keys are the categories and the values are the ratios. In the above example, the ratio between category 0.0 and 1.0 is 1:5.
2.1.4. Examples¶
The full example codes of this section can be found in the following link.