2.5. Data Quality (Integrity Check)

The data quality module in PiML is designed to identify issues related to data. It comprises three submodules: data integrity check, outlier detection, and train-test data drift detection. This article focuses on introducing the data integrity check, a critical test ensuring data accuracy by validating that data values align with the anticipated format, range, and type.

The data integrity check can be used as data is loaded, and all columns except the ones removed in the exp.data_summary module will be tested. Three types of data integrity checks are provided: single-column checks, duplicated samples, and highly correlated features. Note that most of the checks in this submodule are developed based on the deepchecks Python package.

2.5.1. Single-column Checks

This set of integrity tests is specifically tailored for the analysis of individual columns. It can be used by the following command:

exp.data_quality(show="integrity_single_column_check")
  Is Single Value Null Ratio Mixed Data Types Long String Special Characters New Categories
  Mixed Categorical : Numerical Num Index Ratio Example Samples Num Example Samples
season False 0.000000 False - 0 [] 0.000000 [] 0 []
hr False 0.000000 False - 0 [] 0.000000 [] - -
holiday False 0.000000 False - 0 [] 0.000000 [] 0 []
weekday False 0.000000 False - 0 [] 0.000000 [] - -
workingday False 0.000000 False - 0 [] 0.000000 [] 0 []
weathersit False 0.000000 False - 0 [] 0.000000 [] 0 []
atemp False 0.000000 False - 0 [] 0.000000 [] - -
hum False 0.000000 False - 0 [] 0.000000 [] - -
windspeed False 0.000000 False - 0 [] 0.000000 [] - -
cnt False 0.000000 False - 0 [] 0.000000 [] - -

The output of this method is a table that includes the following columns:

  • ‘Is Single Value’: checks whether any columns have only a single unique value across all rows.

  • ‘Null Ratio’: calculates the ratio of ‘null’ or ‘nan’ values in each column.

  • ‘Mixed Data Types’: detects columns containing a mixture of numerical and string values.

    • ‘Mixed’: an indicator of whether the column includes mixed data types.

    • ‘Categorical : Numerical’: when ‘Mixed’ is True, it indicates the ratio of categorical data to numerical data.

  • ‘Long String’: identifies strings with lengths significantly longer than the expected “normal” string lengths.

    • ‘Num’: indicates the size of long string samples.

    • ‘Index’: specifies the sample index of the long string samples.

  • ‘Special Characters’: this test checks for the presence of special characters in each column.

    • ‘Ratio’: represents the proportion of special characters compared to all samples.

    • ‘Example Samples’: displays the top two examples of special characters.

  • ‘New Categories’: identifies new categories in the test set. Note that this functionality only works as exp.data_prepare is executed.

    • ‘Num’: indicates the sample size of new categories.

    • ‘Example Samples’: lists new categories present in the test dataset.

2.5.2. Duplicated Samples

This method includes a test for detecting duplicated samples, which means two samples are the same. See the usage and results below.

exp.data_quality(show="integrity_duplicated_samples")
Leakage season hr holiday weekday workingday weathersit atemp hum windspeed cnt
[1507, 9867] False 1.0 4.0 0.0 2.0 1.0 1.0 0.2727 0.64 0.0000 2.0
[9822, 17336] False 1.0 5.0 0.0 0.0 0.0 2.0 0.2273 0.48 0.2985 2.0
[13559, 13727] True 3.0 4.0 0.0 2.0 1.0 1.0 0.6061 0.83 0.0896 6.0
[5598, 14639] False 3.0 4.0 0.0 5.0 1.0 1.0 0.5606 0.88 0.0000 8.0
[7958, 8126] False 4.0 6.0 0.0 6.0 0.0 1.0 0.2576 0.65 0.1045 11.0

The table above is a summary of all detected duplicated samples. Each row shows a duplicated sample. If no duplication is found, then this table will be empty. The index column shows the duplicated sample indexes (the raw data index). For example, the first row has an index [1507, 9867], which means that the samples with index 1507 and index 9867 are the same, and their feature values are listed in the table.

As the exp.data_prepare is executed, there will be an additional column called “leakage”. If the same sample is found in both the training and testing sets, the value of ‘leakage’ will be set to True, indicating the presence of data leakage. Otherwise, it will be set to False.

2.5.3. Highly correlated features

This is a test for detecting highly correlated features. Depending on the data types of each two features, different correlation methods are used:

  • numerical-numerical: Spearman’s correlation coefficient is used to assess the strength and direction of the monotonic relationship between two numerical variables. For more information, please refer to feature_select.

  • numerical-categorical: The correlation ratio is used to measure the level of correlation between a numerical variable and a categorical variable. It ranges from 0 to 1, where 0 indicates no relationship and 1 indicates a perfect relationship. Assume each observation is \(y_{xi}\), \(x\) is the category, and \(i\) is the index. Let \(n_x\) be the number of observations in category \(x\). \(\overline {y}_{x}\) is the mean of the category \(x\) and \(\overline {y}\) is the mean of the whole population. The formula of correlation ratio \(\eta\) is

\[\begin{aligned} \eta = \sqrt{ {\frac {\sum _{x}n_{x}(\overline {y}_{x}-\overline {y})^{2}}{\sum _{{x,i}}(y_{{xi}}-\overline {y})^{2}}}}. \end{aligned}\]

The numerator is the between-group variability and the denominator is the total variability. In other words, it measures the proportion of variance in the continuous variable that can be explained by the categorical variable.

  • categorical-categorical: Symmetric Theil’s U correlation is used to evaluate how well a categorical variable explains another categorical variable. Theil’s U is an asymmetric measure based on entropy.

\[\begin{aligned} U(X|Y) = \frac{H(X) - H(X|Y)}{H(X)}, \end{aligned}\]
\[\begin{aligned} U(Y|X) = \frac{H(Y) - H(Y|X)}{H(Y)}, \end{aligned}\]

where \(H(X)\) and \(H(Y)\) is entropy of variable \(X\) and \(Y\), respectively. And \(H(X|Y)\) is the conditional entropy of \(X\) given \(Y\). \(H(Y|X)\) is the conditional entropy of \(Y\) given \(X\). The uncertainty coefficient is not symmetric with respect to the roles of \(X\) and \(Y\). The roles can be reversed and a symmetrical measure is thus defined as a weighted average between the two:

\[\begin{split}\begin{aligned} U(X, Y) & =\frac{H(X) U(X \mid Y)+H(Y) U(Y \mid X)}{H(X)+H(Y)} \\ & =2\left[\frac{H(X)+H(Y)-H(X, Y)}{H(X)+H(Y)}\right] . \end{aligned}\end{split}\]

The output values lie in [0, 1], where a zero value means \(Y\) has no information about \(X\) while value 1 means \(Y\) has complete information about \(X\).

exp.data_quality(show="integrity_highly_correlated_features")
  season hr holiday weekday workingday weathersit atemp hum windspeed cnt
season 1.00 0.01 0.00 0.01 0.00 0.00 0.77 0.16 0.16 0.26
hr 0.01 1.00 0.00 -0.00 0.00 0.05 0.13 -0.28 0.14 0.51
holiday 0.00 0.00 1.00 0.10 0.09 0.00 0.03 0.01 0.00 0.03
weekday 0.01 -0.00 0.10 1.00 0.04 0.00 -0.01 -0.04 0.01 0.03
workingday 0.00 0.00 0.09 0.04 1.00 0.00 0.05 0.02 0.01 0.03
weathersit 0.00 0.05 0.00 0.00 0.00 1.00 0.11 0.42 0.08 0.15
atemp 0.77 0.13 0.03 -0.01 0.05 0.11 1.00 -0.05 -0.04 0.42
hum 0.16 -0.28 0.01 -0.04 0.02 0.42 -0.05 1.00 -0.29 -0.36
windspeed 0.16 0.14 0.00 0.01 0.01 0.08 -0.04 -0.29 1.00 0.13
cnt 0.26 0.51 0.03 0.03 0.03 0.15 0.42 -0.36 0.13 1.00

2.5.4. Examples

The full example codes of this section can be found in the following link.