.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples\0_data\plot_4_data_quality.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_0_data_plot_4_data_quality.py: Data Quality Check ===================================== Data quality analysis result using the BikeSharing dataset as example .. GENERATED FROM PYTHON SOURCE LINES 9-10 Experiment initialization and data preparation .. GENERATED FROM PYTHON SOURCE LINES 10-19 .. code-block:: default from piml import Experiment from piml.data.outlier_detection import (PCA, CBLOF, IsolationForest, KMeansTree, OneClassSVM, KNN, HBOS, ECOD) exp = Experiment() exp.data_loader(data="BikeSharing", silent=True) exp.data_summary(feature_exclude=["yr", "mnth", "temp"], silent=True) exp.data_prepare(target="cnt", task_type="regression", silent=True) .. GENERATED FROM PYTHON SOURCE LINES 20-21 Data integrity check for each column .. GENERATED FROM PYTHON SOURCE LINES 21-24 .. code-block:: default res = exp.data_quality(show='integrity_single_column_check', return_data=True) res.data .. rst-class:: sphx-glr-script-out .. code-block:: none .. raw:: html
  Is Single Value Null Ratio Mixed Data Types Long String Special Characters New Categories
  Mixed Categorical : Numerical Num Index Ratio Example Samples Num Example Samples
season False 0.000000 False - 0 [] 0.000000 [] 0 []
hr False 0.000000 False - 0 [] 0.000000 [] - -
holiday False 0.000000 False - 0 [] 0.000000 [] 0 []
weekday False 0.000000 False - 0 [] 0.000000 [] - -
workingday False 0.000000 False - 0 [] 0.000000 [] 0 []
weathersit False 0.000000 False - 0 [] 0.000000 [] 0 []
atemp False 0.000000 False - 0 [] 0.000000 [] - -
hum False 0.000000 False - 0 [] 0.000000 [] - -
windspeed False 0.000000 False - 0 [] 0.000000 [] - -
cnt False 0.000000 False - 0 [] 0.000000 [] - -


.. GENERATED FROM PYTHON SOURCE LINES 25-26 Data integrity check for duplicated samples .. GENERATED FROM PYTHON SOURCE LINES 26-29 .. code-block:: default res = exp.data_quality(show='integrity_duplicated_samples', return_data=True) res.data .. rst-class:: sphx-glr-script-out .. code-block:: none Leakage season hr holiday weekday workingday \ [1507, 9867] False 1.0 4.0 0.0 2.0 1.0 [9822, 17336] False 1.0 5.0 0.0 0.0 0.0 [13559, 13727] True 3.0 4.0 0.0 2.0 1.0 [5598, 14639] False 3.0 4.0 0.0 5.0 1.0 [7958, 8126] False 4.0 6.0 0.0 6.0 0.0 weathersit atemp hum windspeed cnt [1507, 9867] 1.0 0.2727 0.64 0.0000 2.0 [9822, 17336] 2.0 0.2273 0.48 0.2985 2.0 [13559, 13727] 1.0 0.6061 0.83 0.0896 6.0 [5598, 14639] 1.0 0.5606 0.88 0.0000 8.0 [7958, 8126] 1.0 0.2576 0.65 0.1045 11.0 .. raw:: html
Leakage season hr holiday weekday workingday weathersit atemp hum windspeed cnt
[1507, 9867] False 1.0 4.0 0.0 2.0 1.0 1.0 0.2727 0.64 0.0000 2.0
[9822, 17336] False 1.0 5.0 0.0 0.0 0.0 2.0 0.2273 0.48 0.2985 2.0
[13559, 13727] True 3.0 4.0 0.0 2.0 1.0 1.0 0.6061 0.83 0.0896 6.0
[5598, 14639] False 3.0 4.0 0.0 5.0 1.0 1.0 0.5606 0.88 0.0000 8.0
[7958, 8126] False 4.0 6.0 0.0 6.0 0.0 1.0 0.2576 0.65 0.1045 11.0


.. GENERATED FROM PYTHON SOURCE LINES 30-31 Data integrity check for correlated features .. GENERATED FROM PYTHON SOURCE LINES 31-34 .. code-block:: default res = exp.data_quality(show='integrity_highly_correlated_features', return_data=True) res.data .. rst-class:: sphx-glr-script-out .. code-block:: none .. raw:: html
  season hr holiday weekday workingday weathersit atemp hum windspeed cnt
season 1.00 0.01 0.00 0.01 0.00 0.00 0.77 0.16 0.16 0.26
hr 0.01 1.00 0.00 -0.00 0.00 0.05 0.13 -0.28 0.14 0.51
holiday 0.00 0.00 1.00 0.10 0.09 0.00 0.03 0.01 0.00 0.03
weekday 0.01 -0.00 0.10 1.00 0.04 0.00 -0.01 -0.04 0.01 0.03
workingday 0.00 0.00 0.09 0.04 1.00 0.00 0.05 0.02 0.01 0.03
weathersit 0.00 0.05 0.00 0.00 0.00 1.00 0.11 0.42 0.08 0.15
atemp 0.77 0.13 0.03 -0.01 0.05 0.11 1.00 -0.05 -0.04 0.42
hum 0.16 -0.28 0.01 -0.04 0.02 0.42 -0.05 1.00 -0.29 -0.36
windspeed 0.16 0.14 0.00 0.01 0.01 0.08 -0.04 -0.29 1.00 0.13
cnt 0.26 0.51 0.03 0.03 0.03 0.15 0.42 -0.36 0.13 1.00


.. GENERATED FROM PYTHON SOURCE LINES 35-36 Data quality check for score distribution plot .. GENERATED FROM PYTHON SOURCE LINES 36-38 .. code-block:: default exp.data_quality(method=PCA(), show='od_score_distribution', threshold=0.999, figsize=(5, 4)) .. image-sg:: /auto_examples/0_data/images/sphx_glr_plot_4_data_quality_001.png :alt: Score Distribution (PCA) :srcset: /auto_examples/0_data/images/sphx_glr_plot_4_data_quality_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 39-40 Data quality check for score distribution plot .. GENERATED FROM PYTHON SOURCE LINES 40-42 .. code-block:: default exp.data_quality(method=CBLOF(), show='od_score_distribution', threshold=0.999, figsize=(5, 4)) .. image-sg:: /auto_examples/0_data/images/sphx_glr_plot_4_data_quality_002.png :alt: Score Distribution (CBLOF) :srcset: /auto_examples/0_data/images/sphx_glr_plot_4_data_quality_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 43-44 Data quality check for score distribution plot .. GENERATED FROM PYTHON SOURCE LINES 44-46 .. code-block:: default exp.data_quality(method=IsolationForest(), show='od_score_distribution', threshold=0.999, figsize=(5, 4)) .. image-sg:: /auto_examples/0_data/images/sphx_glr_plot_4_data_quality_003.png :alt: Score Distribution (IsolationForest) :srcset: /auto_examples/0_data/images/sphx_glr_plot_4_data_quality_003.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 47-48 Data quality check for score distribution plot .. GENERATED FROM PYTHON SOURCE LINES 48-50 .. code-block:: default exp.data_quality(method=KMeansTree(), show='od_score_distribution', threshold=0.999, figsize=(5, 4)) .. image-sg:: /auto_examples/0_data/images/sphx_glr_plot_4_data_quality_004.png :alt: Score Distribution (KMeansTree) :srcset: /auto_examples/0_data/images/sphx_glr_plot_4_data_quality_004.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 51-52 Data quality check for score distribution plot .. GENERATED FROM PYTHON SOURCE LINES 52-54 .. code-block:: default exp.data_quality(method=KNN(), show='od_score_distribution', threshold=0.999, figsize=(5, 4)) .. image-sg:: /auto_examples/0_data/images/sphx_glr_plot_4_data_quality_005.png :alt: Score Distribution (KNN) :srcset: /auto_examples/0_data/images/sphx_glr_plot_4_data_quality_005.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 55-56 Data quality check for score distribution plot .. GENERATED FROM PYTHON SOURCE LINES 56-58 .. code-block:: default exp.data_quality(method=HBOS(), show='od_score_distribution', threshold=0.999, figsize=(5, 4)) .. image-sg:: /auto_examples/0_data/images/sphx_glr_plot_4_data_quality_006.png :alt: Score Distribution (HBOS) :srcset: /auto_examples/0_data/images/sphx_glr_plot_4_data_quality_006.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 59-60 Data quality check for score distribution plot .. GENERATED FROM PYTHON SOURCE LINES 60-62 .. code-block:: default exp.data_quality(method=ECOD(), show='od_score_distribution', threshold=0.999, figsize=(5, 4)) .. image-sg:: /auto_examples/0_data/images/sphx_glr_plot_4_data_quality_007.png :alt: Score Distribution (ECOD) :srcset: /auto_examples/0_data/images/sphx_glr_plot_4_data_quality_007.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 63-64 Data quality check for score distribution plot .. GENERATED FROM PYTHON SOURCE LINES 64-66 .. code-block:: default exp.data_quality(method=OneClassSVM(), show='od_score_distribution', threshold=0.999, figsize=(5, 4)) .. image-sg:: /auto_examples/0_data/images/sphx_glr_plot_4_data_quality_008.png :alt: Score Distribution (OneClassSVM) :srcset: /auto_examples/0_data/images/sphx_glr_plot_4_data_quality_008.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 67-68 Data quality check for score distribution plot .. GENERATED FROM PYTHON SOURCE LINES 68-71 .. code-block:: default exp.data_quality(method=PCA(), show='od_marginal_outlier_distribution', threshold=0.999, figsize=(5, 4)) .. image-sg:: /auto_examples/0_data/images/sphx_glr_plot_4_data_quality_009.png :alt: Outliers distribution (PCA) :srcset: /auto_examples/0_data/images/sphx_glr_plot_4_data_quality_009.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 72-73 Compare different outlier detection algorithms .. GENERATED FROM PYTHON SOURCE LINES 73-76 .. code-block:: default exp.data_quality(method=[PCA(), CBLOF()], show='od_tsne_comparison', threshold=[0.999, 0.999], figsize=(5, 4)) .. image-sg:: /auto_examples/0_data/images/sphx_glr_plot_4_data_quality_010.png :alt: t-SNE projection :srcset: /auto_examples/0_data/images/sphx_glr_plot_4_data_quality_010.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 77-78 Select a method and threshold and apply the outlier removal (you can also specify train, test, or all data) .. GENERATED FROM PYTHON SOURCE LINES 78-81 .. code-block:: default exp.data_quality(method=CBLOF(), show='od_score_distribution', dataset="train", threshold=0.999, remove_outliers=True, figsize=(5, 4)) .. image-sg:: /auto_examples/0_data/images/sphx_glr_plot_4_data_quality_011.png :alt: Score Distribution (CBLOF) :srcset: /auto_examples/0_data/images/sphx_glr_plot_4_data_quality_011.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 82-83 Compare the train and test data energy distance. .. GENERATED FROM PYTHON SOURCE LINES 83-85 .. code-block:: default exp.data_quality(show='drift_test_info') .. rst-class:: sphx-glr-script-out .. code-block:: none Train Size Test Size Energy Distance 0 13889 3476 0.000488 .. GENERATED FROM PYTHON SOURCE LINES 86-87 Compare the train and test marginal data drift feature-by-feature. .. GENERATED FROM PYTHON SOURCE LINES 87-89 .. code-block:: default exp.data_quality(show='drift_test_distance', figsize=(5, 4)) .. image-sg:: /auto_examples/0_data/images/sphx_glr_plot_4_data_quality_012.png :alt: Population Stability Index (PSI) - Top 30 :srcset: /auto_examples/0_data/images/sphx_glr_plot_4_data_quality_012.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 90-91 Compare the train and test marginal data drift of a given feature. .. GENERATED FROM PYTHON SOURCE LINES 91-93 .. code-block:: default exp.data_quality(show="drift_test_distance", distance_metric="PSI", psi_buckets='quantile', show_feature="atemp", figsize=(5, 4)) .. image-sg:: /auto_examples/0_data/images/sphx_glr_plot_4_data_quality_013.png :alt: Distribution plot :srcset: /auto_examples/0_data/images/sphx_glr_plot_4_data_quality_013.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 2 minutes 3.914 seconds) **Estimated memory usage:** 1326 MB .. _sphx_glr_download_auto_examples_0_data_plot_4_data_quality.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/selfexplainml/piml-toolbox/main?urlpath=lab/tree/./docs/_build/html/notebooks/auto_examples/0_data/plot_4_data_quality.ipynb :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_4_data_quality.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_4_data_quality.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_