.. Places parent toc into the sidebar :parenttoc: True .. include:: ../../includes/big_toc_css.rst ========================== Data Summary ========================== Data summary involves summarizing basic data statistics and setting meta-information for features. As the dataset is loaded in PiML, this function provides an overview of data by data type, enabling you to obtain summary information. Additionally, it allows you to modify feature types and remove specific features. Summary Statistics ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The summary statistics are presented in two separate panels: one for numerical features and another for categorical features. The classification of each feature type is determined automatically based on the data type and the count of unique values. For instance, if the data type is a string and the number of unique values is less than 5, the feature is categorized as categorical. Otherwise, it is regarded as numerical. Numerical Features """""""""""""""""""""""""" The following summary statistics are provided: - n_missing: Number of missing values - mean: Mean - std: Standard deviation - min: Minimum value - q1: First quartile - median: Median - q3: Third quartile - max: Maximum value Categorical Features """""""""""""""""""""""""" The following summary statistics are provided for categorical features: - n_missing: Number of missing values - n_unique: Number of unique values - top1: The highest frequency category - top2: The second highest frequency category - top3: The third highest frequency category - n_others: The number of samples other than the top 3 The data summary module can be called using the function `exp.data_summary`. .. jupyter-execute:: :hide-code: from piml import Experiment exp = Experiment() exp.data_loader(data="BikeSharing", silent=True) .. jupyter-execute:: exp.data_summary(feature_exclude=[], feature_type={}) Feature Manipulation ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In addition to providing summary statistics, this function also offers the flexibility to manipulate features. It allows users to remove features and customize feature types based on specific requirements. This means you can modify the dataset by removing certain features and adjusting the feature types to suit your needs. Remove Features """""""""""""""""""""""""""""""""""" In the following example, we remove three features, including `yr`, `mnth`, and `temp`. The `feature_exclude` parameter is used to specify the features to be removed. The feature names are case-sensitive and must be entered in a list format. .. jupyter-execute:: exp.data_summary(feature_exclude=["yr", "mnth", "temp"]) Change Feature Types """""""""""""""""""""""""""""""""""" Instead of relying solely on automatic feature type determination, you also have the option to manually set the feature type using the `feature_types` parameter. The available categories include "numerical" and "categorical". For instance, if you want to specify the feature type of the mnth feature as categorical, you can do so using the following example: .. jupyter-execute:: exp.data_summary(feature_exclude=["yr", "mnth", "temp"], feature_type={"weekday": "categorical"}) By explicitly setting the feature type, you have greater control over how the data is categorized and can ensure it aligns with your specific requirements. Examples ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The full example codes of this section can be found in the following link. .. topic:: Example * :ref:`sphx_glr_auto_examples_0_data_plot_1_data_summary.py`