2.2. Data Summary¶
Data summary involves summarizing basic data statistics and setting meta-information for features. As the dataset is loaded in PiML, this function provides an overview of data by data type, enabling you to obtain summary information. Additionally, it allows you to modify feature types and remove specific features.
2.2.1. Summary Statistics¶
The summary statistics are presented in two separate panels: one for numerical features and another for categorical features. The classification of each feature type is determined automatically based on the data type and the count of unique values. For instance, if the data type is a string and the number of unique values is less than 5, the feature is categorized as categorical. Otherwise, it is regarded as numerical.
2.2.1.1. Numerical Features¶
The following summary statistics are provided:
n_missing: Number of missing values
mean: Mean
std: Standard deviation
min: Minimum value
q1: First quartile
median: Median
q3: Third quartile
max: Maximum value
2.2.1.2. Categorical Features¶
The following summary statistics are provided for categorical features:
n_missing: Number of missing values
n_unique: Number of unique values
top1: The highest frequency category
top2: The second highest frequency category
top3: The third highest frequency category
n_others: The number of samples other than the top 3
The data summary module can be called using the function exp.data_summary
.
exp.data_summary(feature_exclude=[], feature_type={})
name | n_missing | mean | std | min | q1 | median | q3 | max | |
---|---|---|---|---|---|---|---|---|---|
0 | mnth | 0 | 6.5378 | 3.4388 | 1.0000 | 4.0000 | 7.0000 | 10.0000 | 12.0000 |
1 | hr | 0 | 11.5468 | 6.9144 | 0.0000 | 6.0000 | 12.0000 | 18.0000 | 23.0000 |
2 | weekday | 0 | 3.0037 | 2.0058 | 0.0000 | 1.0000 | 3.0000 | 5.0000 | 6.0000 |
3 | temp | 0 | 0.4970 | 0.1926 | 0.0200 | 0.3400 | 0.5000 | 0.6600 | 1.0000 |
4 | atemp | 0 | 0.4758 | 0.1719 | 0.0000 | 0.3333 | 0.4848 | 0.6212 | 1.0000 |
5 | hum | 0 | 0.6272 | 0.1929 | 0.0000 | 0.4800 | 0.6300 | 0.7800 | 1.0000 |
6 | windspeed | 0 | 0.1901 | 0.1223 | 0.0000 | 0.1045 | 0.1940 | 0.2537 | 0.8507 |
7 | cnt | 0 | 189.4631 | 181.3876 | 1.0000 | 40.0000 | 142.0000 | 281.0000 | 977.0000 |
name | n_missing | n_unique | top1 | top2 | top3 | n_others | |
---|---|---|---|---|---|---|---|
0 | season | 0 | 4 | 3.0 : 4496 | 2.0 : 4409 | 1.0 : 4242 | 4232 |
1 | yr | 0 | 2 | 1.0 : 8734 | 0.0 : 8645 | 0 | 0 |
2 | holiday | 0 | 2 | 0.0 : 16879 | 1.0 : 500 | 0 | 0 |
3 | workingday | 0 | 2 | 1.0 : 11865 | 0.0 : 5514 | 0 | 0 |
4 | weathersit | 0 | 4 | 1.0 : 11413 | 2.0 : 4544 | 3.0 : 1419 | 3 |
Data Shape:(17379, 13)
2.2.2. Feature Manipulation¶
In addition to providing summary statistics, this function also offers the flexibility to manipulate features. It allows users to remove features and customize feature types based on specific requirements. This means you can modify the dataset by removing certain features and adjusting the feature types to suit your needs.
2.2.2.1. Remove Features¶
In the following example, we remove three features, including yr
, mnth
, and temp
. The feature_exclude
parameter is used to specify the features to be removed. The feature names are case-sensitive and must be entered in a list format.
exp.data_summary(feature_exclude=["yr", "mnth", "temp"])
name | n_missing | mean | std | min | q1 | median | q3 | max | |
---|---|---|---|---|---|---|---|---|---|
0 | mnth | 0 | 6.5378 | 3.4388 | 1.0000 | 4.0000 | 7.0000 | 10.0000 | 12.0000 |
1 | hr | 0 | 11.5468 | 6.9144 | 0.0000 | 6.0000 | 12.0000 | 18.0000 | 23.0000 |
2 | weekday | 0 | 3.0037 | 2.0058 | 0.0000 | 1.0000 | 3.0000 | 5.0000 | 6.0000 |
3 | temp | 0 | 0.4970 | 0.1926 | 0.0200 | 0.3400 | 0.5000 | 0.6600 | 1.0000 |
4 | atemp | 0 | 0.4758 | 0.1719 | 0.0000 | 0.3333 | 0.4848 | 0.6212 | 1.0000 |
5 | hum | 0 | 0.6272 | 0.1929 | 0.0000 | 0.4800 | 0.6300 | 0.7800 | 1.0000 |
6 | windspeed | 0 | 0.1901 | 0.1223 | 0.0000 | 0.1045 | 0.1940 | 0.2537 | 0.8507 |
7 | cnt | 0 | 189.4631 | 181.3876 | 1.0000 | 40.0000 | 142.0000 | 281.0000 | 977.0000 |
name | n_missing | n_unique | top1 | top2 | top3 | n_others | |
---|---|---|---|---|---|---|---|
0 | season | 0 | 4 | 3.0 : 4496 | 2.0 : 4409 | 1.0 : 4242 | 4232 |
1 | yr | 0 | 2 | 1.0 : 8734 | 0.0 : 8645 | 0 | 0 |
2 | holiday | 0 | 2 | 0.0 : 16879 | 1.0 : 500 | 0 | 0 |
3 | workingday | 0 | 2 | 1.0 : 11865 | 0.0 : 5514 | 0 | 0 |
4 | weathersit | 0 | 4 | 1.0 : 11413 | 2.0 : 4544 | 3.0 : 1419 | 3 |
Data Shape:(17379, 13)
2.2.2.2. Change Feature Types¶
Instead of relying solely on automatic feature type determination, you also have the option to manually set the feature type using the feature_types
parameter. The available categories include “numerical” and “categorical”. For instance, if you want to specify the feature type of the mnth feature as categorical, you can do so using the following example:
exp.data_summary(feature_exclude=["yr", "mnth", "temp"], feature_type={"weekday": "categorical"})
name | n_missing | mean | std | min | q1 | median | q3 | max | |
---|---|---|---|---|---|---|---|---|---|
0 | mnth | 0 | 6.5378 | 3.4388 | 1.0000 | 4.0000 | 7.0000 | 10.0000 | 12.0000 |
1 | hr | 0 | 11.5468 | 6.9144 | 0.0000 | 6.0000 | 12.0000 | 18.0000 | 23.0000 |
2 | temp | 0 | 0.4970 | 0.1926 | 0.0200 | 0.3400 | 0.5000 | 0.6600 | 1.0000 |
3 | atemp | 0 | 0.4758 | 0.1719 | 0.0000 | 0.3333 | 0.4848 | 0.6212 | 1.0000 |
4 | hum | 0 | 0.6272 | 0.1929 | 0.0000 | 0.4800 | 0.6300 | 0.7800 | 1.0000 |
5 | windspeed | 0 | 0.1901 | 0.1223 | 0.0000 | 0.1045 | 0.1940 | 0.2537 | 0.8507 |
6 | cnt | 0 | 189.4631 | 181.3876 | 1.0000 | 40.0000 | 142.0000 | 281.0000 | 977.0000 |
name | n_missing | n_unique | top1 | top2 | top3 | n_others | |
---|---|---|---|---|---|---|---|
0 | season | 0 | 4 | 3.0 : 4496 | 2.0 : 4409 | 1.0 : 4242 | 4232 |
1 | yr | 0 | 2 | 1.0 : 8734 | 0.0 : 8645 | 0 | 0 |
2 | holiday | 0 | 2 | 0.0 : 16879 | 1.0 : 500 | 0 | 0 |
3 | weekday | 0 | 7 | 6.0 : 2512 | 0.0 : 2502 | 5.0 : 2487 | 9878 |
4 | workingday | 0 | 2 | 1.0 : 11865 | 0.0 : 5514 | 0 | 0 |
5 | weathersit | 0 | 4 | 1.0 : 11413 | 2.0 : 4544 | 3.0 : 1419 | 3 |
Data Shape:(17379, 13)
By explicitly setting the feature type, you have greater control over how the data is categorized and can ensure it aligns with your specific requirements.
2.2.3. Examples¶
The full example codes of this section can be found in the following link.