piml.data.outlier_detection.KMeansTree

class piml.data.outlier_detection.KMeansTree(n_components=10, d_reduction_method='pca', alpha=1, max_depth=4, max_leaves=10, min_samples_leaf=50, min_distance=0.2, distance_measure='ReconstErr', distance_measure_param=None, standardization=True, random_state=0)

Recursive unsupervised splitting tree via KMeans (K=2).

Parameters:
n_componentsint, default=10

The number of components in PCA or SparsePCA.

d_reduction_method{‘pca’, ‘sparse_pca’}, default=’pca’

The dimension reduction algorithm.

alphafloat, default=1

The sparsity parameter in SparsePCA.

max_depthint, default=3

The max depth of the tree.

max_leavesint, default=64

The max number of leaves.

min_samples_leaffloat, default=50

The minimum number of samples of a leaf node.

min_distancefloat, default=0.2

The minimum square root distance for splitting a node.

distance_measure{“ReconstErr”, “Euclidean”, “PSI”, “KS”, “WD1”} or callable function, default=”ReconstErr”

The distance measure of two clusters. Here we provide four built-in distance measures, and you may also input a callable function that calculate the distance of two given samples (e.g., one with shape n_samples1 * n_features, and another with shape n_samples2 * n_features)

distance_measure_paramdict, default=None

The custom parameters of the user defined distance measure function. Only used when distance_measure is a callable function

standardizationbool, default=True

Whether to standardize covariates before running the algorithm.

random_stateint, default=0

The random seed.

Attributes:
n_features_in_int

The number of input features.

is_fitted_bool

Indicator of whether the model is fitted.

tree_dict

The fitted tree information.

node_count_int

The total number of nodes.

leaf_idx_list_list of int

The list of leaf node ids.

Methods

decision_function(X[, scale])

Predict raw outliers score of X using the fitted detector.

decision_path(X[, scale])

Returns the decision path per sample.

fit(X[, y, sample_weight])

Fit the model.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

get_rule([node_id, left])

Returns the splitting rule of a given node id.

plot_tree([draw_depth, start_node_id, figsize])

Draw the tree diagram.

predict([X, scale, threshold])

Predict raw outlier indicator.

predict_leaf_id(X)

Returns the predicted cluster (leaf node id) per sample.

set_params(**params)

Set the parameters of this estimator.

calculate_spca

dist

decision_function(X, scale=True)
Predict raw outliers score of X using the fitted detector.

For consistency, outliers are assigned with larger anomaly scores.

Parameters:
Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

scalebool, default=True

If True, scale X before calculating the outlier score.

Returns:
outlier_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

decision_path(X, scale=True)

Returns the decision path per sample.

Parameters:
Xnp.ndarray of shape (n_samples, n_features)

Data features.

scalebool, default=True

If True, scale X before calculating the outlier score.

Returns:
path_allnp.ndarray of shape (n_samples, node_count)

The on/off status per sample * node.

fit(X, y=None, sample_weight=None)

Fit the model.

Parameters:
Xnp.ndarray of shape (n_samples, n_features)

Data features.

ynp.ndarray of shape (n_samples,), default=None

Data response.

sample_weightnp.ndarray of shape (n_samples, ), default=None

Sample weight.

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

get_rule(node_id=1, left=True)

Returns the splitting rule of a given node id.

Parameters:
node_idint, default=1

The node id .

leftbool, default=True

The left child node or right child node.

Returns:
rulestr

A string describing the rule, which is a linear inequality.

plot_tree(draw_depth=inf, start_node_id=1, figsize=(10, 6))

Draw the tree diagram.

Parameters:
draw_depthint, default=1

The drawing depth starting from the start_node_id.

start_node_idbool, default=True

The node id that the tree diagram begins with.

figsizetuple, default=(10, 6)

The size of the figure.

predict(X=None, scale=True, threshold=0.9)

Predict raw outlier indicator.

Normal samples are classified as 1 and outliers are classified as -1.

Parameters:
Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

scalebool, default=True

If True, scale X before calculating the outlier score.

thresholdfloat, default=0.9

The quantile threshold of outliers. For example, the samples with outlier scores greater than 90% quantile of the whole sample will be classified as outliers.

Returns:
outlier_indicatornumpy array of shape (n_samples,)

The binary array indicating whether each sample is outlier.

predict_leaf_id(X)

Returns the predicted cluster (leaf node id) per sample.

Parameters:
Xnp.ndarray of shape (n_samples, n_features)

Data features.

Returns:
predictionnp.ndarray of shape (n_samples, )

The cluster (leaf node id) per sample.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

Examples using piml.data.outlier_detection.KMeansTree

Data Quality Check

Data Quality Check