2.5. Data Quality (Integrity Check)¶
The data quality module in PiML is designed to identify issues related to data. It comprises three submodules: data integrity check, outlier detection, and train-test data drift detection. This article focuses on introducing the data integrity check, a critical test ensuring data accuracy by validating that data values align with the anticipated format, range, and type.
The data integrity check can be used as data is loaded, and all columns except the ones removed in the exp.data_summary
module will be tested. Three types of data integrity checks are provided: single-column checks, duplicated samples, and highly correlated features. Note that most of the checks in this submodule are developed based on the deepchecks Python package.
2.5.1. Single-column Checks¶
This set of integrity tests is specifically tailored for the analysis of individual columns. It can be used by the following command:
exp.data_quality(show="integrity_single_column_check")
The output of this method is a table that includes the following columns:
‘Is Single Value’: checks whether any columns have only a single unique value across all rows.
‘Null Ratio’: calculates the ratio of ‘null’ or ‘nan’ values in each column.
‘Mixed Data Types’: detects columns containing a mixture of numerical and string values.
‘Mixed’: an indicator of whether the column includes mixed data types.
‘Categorical : Numerical’: when ‘Mixed’ is True, it indicates the ratio of categorical data to numerical data.
‘Long String’: identifies strings with lengths significantly longer than the expected “normal” string lengths.
‘Num’: indicates the size of long string samples.
‘Index’: specifies the sample index of the long string samples.
‘Special Characters’: this test checks for the presence of special characters in each column.
‘Ratio’: represents the proportion of special characters compared to all samples.
‘Example Samples’: displays the top two examples of special characters.
‘New Categories’: identifies new categories in the test set. Note that this functionality only works as
exp.data_prepare
is executed.‘Num’: indicates the sample size of new categories.
‘Example Samples’: lists new categories present in the test dataset.
2.5.2. Duplicated Samples¶
This method includes a test for detecting duplicated samples, which means two samples are the same. See the usage and results below.
exp.data_quality(show="integrity_duplicated_samples")
The table above is a summary of all detected duplicated samples. Each row shows a duplicated sample. If no duplication is found, then this table will be empty. The index column shows the duplicated sample indexes (the raw data index). For example, the first row has an index [1507, 9867], which means that the samples with index 1507 and index 9867 are the same, and their feature values are listed in the table.
As the exp.data_prepare
is executed, there will be an additional column called “leakage”. If the same sample is found in both the training and testing sets, the value of ‘leakage’ will be set to True, indicating the presence of data leakage. Otherwise, it will be set to False.
2.5.4. Examples¶
The full example codes of this section can be found in the following link.