Note
Go to the end to download the full example code or to run this example in your browser via Binder
Data Load (Spark)¶
Generate a parquet data for demonstration
import numpy as np
import pandas as pd
np.random.seed(0)
original_df = pd.DataFrame(
np.hstack([np.random.randint(2, size=(100000, 1)), np.random.uniform(-1, 1, size=(100000, 10))]),
columns=["Y"] + ["X" + str(i) for i in range(10)]
)
original_df.to_parquet('myfile.parquet')
Experiment initialization
from piml import Experiment
exp = Experiment()
Data loading with 10000 samples (purly randomly)
exp.data_loader(data="./myfile.parquet", spark=True, spark_sample_size=10000, spark_random_state=0)
Y X0 X1 X2 X3 X4 X5 \
0 1.0 0.722108 0.630742 -0.179797 0.046212 -0.622879 -0.087904
1 1.0 -0.100578 0.064457 0.261228 0.298484 0.806278 0.297039
2 0.0 0.306140 -0.979178 0.924876 0.148939 0.794436 -0.026896
3 1.0 -0.597057 -0.736664 -0.261621 -0.585320 0.459879 -0.581889
4 1.0 0.262903 -0.676545 -0.521757 0.051982 -0.439505 0.422332
... ... ... ... ... ... ... ...
9936 1.0 -0.377520 -0.149886 0.267827 -0.666580 -0.233089 -0.401495
9937 0.0 0.952963 -0.415568 0.688008 -0.225129 -0.169116 -0.190439
9938 0.0 0.601043 -0.634523 -0.282413 0.934208 0.492232 -0.598559
9939 0.0 -0.524421 -0.683701 0.365216 -0.107878 0.095759 -0.190358
9940 0.0 -0.347192 -0.458382 0.415295 0.618297 0.117932 0.873784
X6 X7 X8 X9
0 0.524907 -0.493184 -0.508395 0.993584
1 0.970116 0.378889 0.077614 -0.703995
2 0.054786 0.890156 0.752913 -0.938883
3 0.791780 0.018643 -0.365007 -0.048767
4 -0.995145 0.868960 0.452228 -0.110775
... ... ... ... ...
9936 0.543934 -0.444068 0.231257 -0.025619
9937 -0.352821 0.611961 -0.665948 -0.449606
9938 -0.500579 0.161602 0.190709 -0.113384
9939 0.462231 0.612478 -0.587393 0.313063
9940 -0.377266 -0.734037 -0.113734 0.830916
[9941 rows x 11 columns]
Data loading with 10000 samples (stratified sampling)
exp.data_loader(data="./myfile.parquet", spark=True, spark_sample_size=10000,
spark_sample_by_feature='Y', spark_random_state=0)
Y X0 X1 X2 X3 X4 X5 \
0 1.0 0.722108 0.630742 -0.179797 0.046212 -0.622879 -0.087904
1 1.0 -0.100578 0.064457 0.261228 0.298484 0.806278 0.297039
2 0.0 0.306140 -0.979178 0.924876 0.148939 0.794436 -0.026896
3 1.0 -0.597057 -0.736664 -0.261621 -0.585320 0.459879 -0.581889
4 1.0 0.262903 -0.676545 -0.521757 0.051982 -0.439505 0.422332
... ... ... ... ... ... ... ...
9946 1.0 -0.377520 -0.149886 0.267827 -0.666580 -0.233089 -0.401495
9947 0.0 0.952963 -0.415568 0.688008 -0.225129 -0.169116 -0.190439
9948 0.0 0.601043 -0.634523 -0.282413 0.934208 0.492232 -0.598559
9949 0.0 -0.524421 -0.683701 0.365216 -0.107878 0.095759 -0.190358
9950 0.0 -0.347192 -0.458382 0.415295 0.618297 0.117932 0.873784
X6 X7 X8 X9
0 0.524907 -0.493184 -0.508395 0.993584
1 0.970116 0.378889 0.077614 -0.703995
2 0.054786 0.890156 0.752913 -0.938883
3 0.791780 0.018643 -0.365007 -0.048767
4 -0.995145 0.868960 0.452228 -0.110775
... ... ... ... ...
9946 0.543934 -0.444068 0.231257 -0.025619
9947 -0.352821 0.611961 -0.665948 -0.449606
9948 -0.500579 0.161602 0.190709 -0.113384
9949 0.462231 0.612478 -0.587393 0.313063
9950 -0.377266 -0.734037 -0.113734 0.830916
[9951 rows x 11 columns]
Data loading with 10000 samples (stratified sampling with given uneven ratios)
exp.data_loader(data="./myfile.parquet", spark=True, spark_sample_size=10000,
spark_sample_by_feature='Y', spark_sample_fractions={0.0: 1, 1.0: 5},
spark_random_state=0)
Y X0 X1 X2 X3 X4 X5 \
0 1.0 0.722108 0.630742 -0.179797 0.046212 -0.622879 -0.087904
1 1.0 -0.100578 0.064457 0.261228 0.298484 0.806278 0.297039
2 1.0 -0.321004 -0.340024 -0.614322 -0.863309 0.402909 -0.380781
3 1.0 -0.597057 -0.736664 -0.261621 -0.585320 0.459879 -0.581889
4 1.0 0.262903 -0.676545 -0.521757 0.051982 -0.439505 0.422332
... ... ... ... ... ... ... ...
9954 1.0 -0.802121 -0.425349 0.139584 -0.365250 0.677365 0.962125
9955 1.0 -0.377520 -0.149886 0.267827 -0.666580 -0.233089 -0.401495
9956 0.0 0.952963 -0.415568 0.688008 -0.225129 -0.169116 -0.190439
9957 0.0 0.601043 -0.634523 -0.282413 0.934208 0.492232 -0.598559
9958 1.0 -0.046856 0.428831 0.600180 0.017023 -0.689377 -0.804052
X6 X7 X8 X9
0 0.524907 -0.493184 -0.508395 0.993584
1 0.970116 0.378889 0.077614 -0.703995
2 0.854092 0.558963 -0.723463 0.290868
3 0.791780 0.018643 -0.365007 -0.048767
4 -0.995145 0.868960 0.452228 -0.110775
... ... ... ... ...
9954 -0.235756 -0.846328 -0.983629 0.139173
9955 0.543934 -0.444068 0.231257 -0.025619
9956 -0.352821 0.611961 -0.665948 -0.449606
9957 -0.500579 0.161602 0.190709 -0.113384
9958 0.822894 0.489409 0.372789 -0.979409
[9959 rows x 11 columns]
Total running time of the script: ( 0 minutes 39.289 seconds)
Estimated memory usage: 54 MB