2.2. Data Summary

Data summary involves summarizing basic data statistics and setting meta-information for features. As the dataset is loaded in PiML, this function provides an overview of data by data type, enabling you to obtain summary information. Additionally, it allows you to modify feature types and remove specific features.

2.2.1. Summary Statistics

The summary statistics are presented in two separate panels: one for numerical features and another for categorical features. The classification of each feature type is determined automatically based on the data type and the count of unique values. For instance, if the data type is a string and the number of unique values is less than 5, the feature is categorized as categorical. Otherwise, it is regarded as numerical.

2.2.1.1. Numerical Features

The following summary statistics are provided:

  • n_missing: Number of missing values

  • mean: Mean

  • std: Standard deviation

  • min: Minimum value

  • q1: First quartile

  • median: Median

  • q3: Third quartile

  • max: Maximum value

2.2.1.2. Categorical Features

The following summary statistics are provided for categorical features:

  • n_missing: Number of missing values

  • n_unique: Number of unique values

  • top1: The highest frequency category

  • top2: The second highest frequency category

  • top3: The third highest frequency category

  • n_others: The number of samples other than the top 3

The data summary module can be called using the function exp.data_summary.

exp.data_summary(feature_exclude=[], feature_type={})
Numerical Attributes
  name n_missing mean std min q1 median q3 max
0 mnth 0 6.5378 3.4388 1.0000 4.0000 7.0000 10.0000 12.0000
1 hr 0 11.5468 6.9144 0.0000 6.0000 12.0000 18.0000 23.0000
2 weekday 0 3.0037 2.0058 0.0000 1.0000 3.0000 5.0000 6.0000
3 temp 0 0.4970 0.1926 0.0200 0.3400 0.5000 0.6600 1.0000
4 atemp 0 0.4758 0.1719 0.0000 0.3333 0.4848 0.6212 1.0000
5 hum 0 0.6272 0.1929 0.0000 0.4800 0.6300 0.7800 1.0000
6 windspeed 0 0.1901 0.1223 0.0000 0.1045 0.1940 0.2537 0.8507
7 cnt 0 189.4631 181.3876 1.0000 40.0000 142.0000 281.0000 977.0000
Categorical Attributes
  name n_missing n_unique top1 top2 top3 n_others
0 season 0 4 3.0 : 4496 2.0 : 4409 1.0 : 4242 4232
1 yr 0 2 1.0 : 8734 0.0 : 8645 0 0
2 holiday 0 2 0.0 : 16879 1.0 : 500 0 0
3 workingday 0 2 1.0 : 11865 0.0 : 5514 0 0
4 weathersit 0 4 1.0 : 11413 2.0 : 4544 3.0 : 1419 3
Data Shape:(17379, 13)

2.2.2. Feature Manipulation

In addition to providing summary statistics, this function also offers the flexibility to manipulate features. It allows users to remove features and customize feature types based on specific requirements. This means you can modify the dataset by removing certain features and adjusting the feature types to suit your needs.

2.2.2.1. Remove Features

In the following example, we remove three features, including yr, mnth, and temp. The feature_exclude parameter is used to specify the features to be removed. The feature names are case-sensitive and must be entered in a list format.

exp.data_summary(feature_exclude=["yr", "mnth", "temp"])
Numerical Attributes
  name n_missing mean std min q1 median q3 max
0 mnth 0 6.5378 3.4388 1.0000 4.0000 7.0000 10.0000 12.0000
1 hr 0 11.5468 6.9144 0.0000 6.0000 12.0000 18.0000 23.0000
2 weekday 0 3.0037 2.0058 0.0000 1.0000 3.0000 5.0000 6.0000
3 temp 0 0.4970 0.1926 0.0200 0.3400 0.5000 0.6600 1.0000
4 atemp 0 0.4758 0.1719 0.0000 0.3333 0.4848 0.6212 1.0000
5 hum 0 0.6272 0.1929 0.0000 0.4800 0.6300 0.7800 1.0000
6 windspeed 0 0.1901 0.1223 0.0000 0.1045 0.1940 0.2537 0.8507
7 cnt 0 189.4631 181.3876 1.0000 40.0000 142.0000 281.0000 977.0000
Categorical Attributes
  name n_missing n_unique top1 top2 top3 n_others
0 season 0 4 3.0 : 4496 2.0 : 4409 1.0 : 4242 4232
1 yr 0 2 1.0 : 8734 0.0 : 8645 0 0
2 holiday 0 2 0.0 : 16879 1.0 : 500 0 0
3 workingday 0 2 1.0 : 11865 0.0 : 5514 0 0
4 weathersit 0 4 1.0 : 11413 2.0 : 4544 3.0 : 1419 3
Data Shape:(17379, 13)

2.2.2.2. Change Feature Types

Instead of relying solely on automatic feature type determination, you also have the option to manually set the feature type using the feature_types parameter. The available categories include “numerical” and “categorical”. For instance, if you want to specify the feature type of the mnth feature as categorical, you can do so using the following example:

exp.data_summary(feature_exclude=["yr", "mnth", "temp"], feature_type={"weekday": "categorical"})
Numerical Attributes
  name n_missing mean std min q1 median q3 max
0 mnth 0 6.5378 3.4388 1.0000 4.0000 7.0000 10.0000 12.0000
1 hr 0 11.5468 6.9144 0.0000 6.0000 12.0000 18.0000 23.0000
2 temp 0 0.4970 0.1926 0.0200 0.3400 0.5000 0.6600 1.0000
3 atemp 0 0.4758 0.1719 0.0000 0.3333 0.4848 0.6212 1.0000
4 hum 0 0.6272 0.1929 0.0000 0.4800 0.6300 0.7800 1.0000
5 windspeed 0 0.1901 0.1223 0.0000 0.1045 0.1940 0.2537 0.8507
6 cnt 0 189.4631 181.3876 1.0000 40.0000 142.0000 281.0000 977.0000
Categorical Attributes
  name n_missing n_unique top1 top2 top3 n_others
0 season 0 4 3.0 : 4496 2.0 : 4409 1.0 : 4242 4232
1 yr 0 2 1.0 : 8734 0.0 : 8645 0 0
2 holiday 0 2 0.0 : 16879 1.0 : 500 0 0
3 weekday 0 7 6.0 : 2512 0.0 : 2502 5.0 : 2487 9878
4 workingday 0 2 1.0 : 11865 0.0 : 5514 0 0
5 weathersit 0 4 1.0 : 11413 2.0 : 4544 3.0 : 1419 3
Data Shape:(17379, 13)

By explicitly setting the feature type, you have greater control over how the data is categorized and can ensure it aligns with your specific requirements.

2.2.3. Examples

The full example codes of this section can be found in the following link.