8.2. CaliforniaHousing Data¶

This example notebook demonstrates how to use PiML in its low-code mode for developing machine learning models for the CaliforniaHousing data, which consists of 20,640 samples and 9 features, fetched by sklearn.datasets (see details here). PiML can load three versions of this data, including _raw, _trim1 (trimming only AveOccup) and _trim2 (trimming AveRooms, AveBedrms, Population and AveOccup). The _trim2 version is used in this example.

The response MedHouseVal (median house price per block in log scale) is continuous and it is a regression problem.

Click the ipynb links to run examples in Google Colab.

8.2.1. Load and Prepare Data¶

[1]:

from piml import Experiment
exp = Experiment()

[3]:

# Choose CaliforniaHousing_trim2
exp.data_loader()

[4]:

exp.data_summary()

[5]:

# Prepare dataset with default settings
exp.data_prepare()

[6]:

exp.feature_select()

[7]:

# Exploratory data analysis, check distribution and correlation
exp.eda()

8.2.2. Train Intepretable Models¶

[8]:

# First, choose GLM and ReLU-DNN with default settings, click run;
# Then, choose only ReLU-DNN and customize it with L1=0.0005; Reigster the three models
exp.model_train()

8.2.3. Interpretability and Explainability¶

[9]:

# Model-specific inherent interpretation including feature importance, main effects and pairwise interactions.
exp.model_interpret()

[10]:

# Model-agnostic post-hoc explanation by Permutation Feature Importance, PDP (1D and 2D) vs. ALE (1D and 2D), LIME vs. SHAP
exp.model_explain()

8.2.4. Model Diagnostics and Outcome Testing¶

[11]:

exp.model_diagnose()

8.2.5. Model Comparison and Benchmarking¶

[12]:

exp.model_compare()