Data Load (Spark)

Generate a parquet data for demonstration

import numpy as np
import pandas as pd

np.random.seed(0)
original_df = pd.DataFrame(
    np.hstack([np.random.randint(2, size=(100000, 1)), np.random.uniform(-1, 1, size=(100000, 10))]),
    columns=["Y"] + ["X" + str(i) for i in range(10)]
   )
original_df.to_parquet('myfile.parquet')

Experiment initialization

from piml import Experiment
exp = Experiment()

Data loading with 10000 samples (purly randomly)

exp.data_loader(data="./myfile.parquet", spark=True, spark_sample_size=10000, spark_random_state=0)
        Y        X0        X1        X2        X3        X4        X5  \
0     1.0  0.722108  0.630742 -0.179797  0.046212 -0.622879 -0.087904
1     1.0 -0.100578  0.064457  0.261228  0.298484  0.806278  0.297039
2     0.0  0.306140 -0.979178  0.924876  0.148939  0.794436 -0.026896
3     1.0 -0.597057 -0.736664 -0.261621 -0.585320  0.459879 -0.581889
4     1.0  0.262903 -0.676545 -0.521757  0.051982 -0.439505  0.422332
...   ...       ...       ...       ...       ...       ...       ...
9936  1.0 -0.377520 -0.149886  0.267827 -0.666580 -0.233089 -0.401495
9937  0.0  0.952963 -0.415568  0.688008 -0.225129 -0.169116 -0.190439
9938  0.0  0.601043 -0.634523 -0.282413  0.934208  0.492232 -0.598559
9939  0.0 -0.524421 -0.683701  0.365216 -0.107878  0.095759 -0.190358
9940  0.0 -0.347192 -0.458382  0.415295  0.618297  0.117932  0.873784

            X6        X7        X8        X9
0     0.524907 -0.493184 -0.508395  0.993584
1     0.970116  0.378889  0.077614 -0.703995
2     0.054786  0.890156  0.752913 -0.938883
3     0.791780  0.018643 -0.365007 -0.048767
4    -0.995145  0.868960  0.452228 -0.110775
...        ...       ...       ...       ...
9936  0.543934 -0.444068  0.231257 -0.025619
9937 -0.352821  0.611961 -0.665948 -0.449606
9938 -0.500579  0.161602  0.190709 -0.113384
9939  0.462231  0.612478 -0.587393  0.313063
9940 -0.377266 -0.734037 -0.113734  0.830916

[9941 rows x 11 columns]

Data loading with 10000 samples (stratified sampling)

exp.data_loader(data="./myfile.parquet", spark=True, spark_sample_size=10000,
                spark_sample_by_feature='Y', spark_random_state=0)
        Y        X0        X1        X2        X3        X4        X5  \
0     1.0  0.722108  0.630742 -0.179797  0.046212 -0.622879 -0.087904
1     1.0 -0.100578  0.064457  0.261228  0.298484  0.806278  0.297039
2     0.0  0.306140 -0.979178  0.924876  0.148939  0.794436 -0.026896
3     1.0 -0.597057 -0.736664 -0.261621 -0.585320  0.459879 -0.581889
4     1.0  0.262903 -0.676545 -0.521757  0.051982 -0.439505  0.422332
...   ...       ...       ...       ...       ...       ...       ...
9946  1.0 -0.377520 -0.149886  0.267827 -0.666580 -0.233089 -0.401495
9947  0.0  0.952963 -0.415568  0.688008 -0.225129 -0.169116 -0.190439
9948  0.0  0.601043 -0.634523 -0.282413  0.934208  0.492232 -0.598559
9949  0.0 -0.524421 -0.683701  0.365216 -0.107878  0.095759 -0.190358
9950  0.0 -0.347192 -0.458382  0.415295  0.618297  0.117932  0.873784

            X6        X7        X8        X9
0     0.524907 -0.493184 -0.508395  0.993584
1     0.970116  0.378889  0.077614 -0.703995
2     0.054786  0.890156  0.752913 -0.938883
3     0.791780  0.018643 -0.365007 -0.048767
4    -0.995145  0.868960  0.452228 -0.110775
...        ...       ...       ...       ...
9946  0.543934 -0.444068  0.231257 -0.025619
9947 -0.352821  0.611961 -0.665948 -0.449606
9948 -0.500579  0.161602  0.190709 -0.113384
9949  0.462231  0.612478 -0.587393  0.313063
9950 -0.377266 -0.734037 -0.113734  0.830916

[9951 rows x 11 columns]

Data loading with 10000 samples (stratified sampling with given uneven ratios)

exp.data_loader(data="./myfile.parquet", spark=True, spark_sample_size=10000,
                spark_sample_by_feature='Y', spark_sample_fractions={0.0: 1, 1.0: 5},
                spark_random_state=0)
        Y        X0        X1        X2        X3        X4        X5  \
0     1.0  0.722108  0.630742 -0.179797  0.046212 -0.622879 -0.087904
1     1.0 -0.100578  0.064457  0.261228  0.298484  0.806278  0.297039
2     1.0 -0.321004 -0.340024 -0.614322 -0.863309  0.402909 -0.380781
3     1.0 -0.597057 -0.736664 -0.261621 -0.585320  0.459879 -0.581889
4     1.0  0.262903 -0.676545 -0.521757  0.051982 -0.439505  0.422332
...   ...       ...       ...       ...       ...       ...       ...
9954  1.0 -0.802121 -0.425349  0.139584 -0.365250  0.677365  0.962125
9955  1.0 -0.377520 -0.149886  0.267827 -0.666580 -0.233089 -0.401495
9956  0.0  0.952963 -0.415568  0.688008 -0.225129 -0.169116 -0.190439
9957  0.0  0.601043 -0.634523 -0.282413  0.934208  0.492232 -0.598559
9958  1.0 -0.046856  0.428831  0.600180  0.017023 -0.689377 -0.804052

            X6        X7        X8        X9
0     0.524907 -0.493184 -0.508395  0.993584
1     0.970116  0.378889  0.077614 -0.703995
2     0.854092  0.558963 -0.723463  0.290868
3     0.791780  0.018643 -0.365007 -0.048767
4    -0.995145  0.868960  0.452228 -0.110775
...        ...       ...       ...       ...
9954 -0.235756 -0.846328 -0.983629  0.139173
9955  0.543934 -0.444068  0.231257 -0.025619
9956 -0.352821  0.611961 -0.665948 -0.449606
9957 -0.500579  0.161602  0.190709 -0.113384
9958  0.822894  0.489409  0.372789 -0.979409

[9959 rows x 11 columns]

Total running time of the script: ( 0 minutes 39.289 seconds)

Estimated memory usage: 54 MB

Gallery generated by Sphinx-Gallery