.. Places parent toc into the sidebar :parenttoc: True .. include:: ../../includes/big_toc_css.rst ================================= Hstats (Friedman's H-statistic) ================================= H-statistic measures the interaction strength of two features [Friedman2008]_. Algorithm Details ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Consider a set of features, represented by :math:`X`, and a fitted model, represented by :math:`\hat{f}`. The H-statistic is defined based on partial dependence, as follows: .. math:: \begin{align} H_{j k}^2=\frac{\sum_{i=1}^n\left[P D_{j k}\left(x_j^{(i)}, x_k^{(i)}\right)-P D_j\left(x_j^{(i)}\right)-P D_k\left(x_k^{(i)}\right)\right]^2}{\sum_{i=1}^n P D_{j k}^2\left(x_j^{(i)}, x_k^{(i)}\right)}, \tag{1} \end{align} where feature :math:`j` and :math:`k` are two features in :math:`X`, :math:`x_j^{(i)}` and :math:`x_k^{(i)}` are the values of features :math:`j` and :math:`k` for the :math:`i`-th sample, respectively, and :math:`PD_{jk}(x_j^{(i)}, x_k^{(i)})` is the partial dependence of :math:`\hat{f}` on features :math:`j` and :math:`k` at :math:`(x_j^{(i)}, x_k^{(i)})`. The H-statistic is a measure of the interaction strength between features :math:`j` and :math:`k`. The larger the H-statistic, the stronger the interaction between features :math:`j` and :math:`k`. The H-statistic is symmetric, i.e., :math:`H_{jk}=H_{kj}`. Usage ^^^^^^^^^^^^^^^^^ H-statistic can be calculated using PiML's `model_explain` function. The keyword for PDP is "hstats", i.e., we should set `show` = "hstats". Additionally, the following arguments are relevant to this analysis: - `use_test`: If True, the test data will be used to generate the explanations. Otherwise, the training data will be used. The default value is False. - `sample_size`: To speed up the computation, we subsample a subset of the data to calculate PDP. The default value is 2000. To use the full data, you can set `sample_size` to be larger than the number of samples in the data. - `grid_size`: The number of grid points in PDP. The default value is 10. - `response_method`: For binary classification tasks, the PDP is computed by default using the predicted probability instead of log odds; If the model does not have "predict_proba" or we set `response_method` to "decision_function", then the log odds would be used as the response. The following code shows how to calculate the H-statistic of a fitted XGB2 model. .. jupyter-input:: exp.model_explain(model="XGB2", show="hstats", sample_size=2000, grid_size=5, figsize=(5, 4)) .. figure:: ../../auto_examples/2_explain/images/sphx_glr_plot_1_pdp_hstats_001.png :target: ../../auto_examples/2_explain/plot_1_pdp_hstats.html :align: left The plot above lists the top-10 important interactions. To get the H-statistic of the full list of interactions, we can set `return_data=True`, and the H-statistic of all interactions will be returned as a dataframe, as shown below. .. jupyter-input:: result = exp.model_explain(model="XGB2", show="hstats", sample_size=2000, grid_size=5, return_data=True, figsize=(5, 4)) result.data .. raw:: html
Examples ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. topic:: Example 1: Bike Sharing The first example below demonstrates how to use PiML with its high-code APIs for developing machine learning models for the BikeSharing data from the UCI repository, which consists of 17,389 samples of hourly counts of rental bikes in Capital bikeshare system; see details. The response `cnt` (hourly bike rental counts) is continuous and it is a regression problem. * :ref:`sphx_glr_auto_examples_2_explain_plot_1_pdp_hstats.py` .. topic:: Example 2: SimuCredit The second example shows the option to use test set to generate the explanations. * :ref:`sphx_glr_auto_examples_2_explain_plot_6_data_dependent_explain.py` .. topic:: References .. [Friedman2008] Friedman, Jerome H., and Bogdan E. Popescu (2008). `Predictive learning via rule ensembles.