8.2. CaliforniaHousing Data

This example notebook demonstrates how to use PiML in its low-code mode for developing machine learning models for the CaliforniaHousing data, which consists of 20,640 samples and 9 features, fetched by sklearn.datasets (see details here). PiML can load three versions of this data, including _raw, _trim1 (trimming only AveOccup) and _trim2 (trimming AveRooms, AveBedrms, Population and AveOccup). The _trim2 version is used in this example.

The response MedHouseVal (median house price per block in log scale) is continuous and it is a regression problem.

Click the ipynb links to run examples in Google Colab.

8.2.1. Load and Prepare Data

[1]:
from piml import Experiment
exp = Experiment()
[3]:
# Choose CaliforniaHousing_trim2
exp.data_loader()
[4]:
exp.data_summary()
[5]:
# Prepare dataset with default settings
exp.data_prepare()
[6]:
exp.feature_select()
[7]:
# Exploratory data analysis, check distribution and correlation
exp.eda()

8.2.2. Train Intepretable Models

[8]:
# First, choose GLM and ReLU-DNN with default settings, click run;
# Then, choose only ReLU-DNN and customize it with L1=0.0005; Reigster the three models
exp.model_train()

8.2.3. Interpretability and Explainability

[9]:
# Model-specific inherent interpretation including feature importance, main effects and pairwise interactions.
exp.model_interpret()
[10]:
# Model-agnostic post-hoc explanation by Permutation Feature Importance, PDP (1D and 2D) vs. ALE (1D and 2D), LIME vs. SHAP
exp.model_explain()

8.2.4. Model Diagnostics and Outcome Testing

[11]:
exp.model_diagnose()

8.2.5. Model Comparison and Benchmarking

[12]:
exp.model_compare()