`piml.data.outlier_detection`.KMeansTree¶

class piml.data.outlier_detection.KMeansTree(n_components=10, d_reduction_method='pca', alpha=1, max_depth=4, max_leaves=10, min_samples_leaf=50, min_distance=0.2, distance_measure='ReconstErr', distance_measure_param=None, standardization=True, random_state=0)¶

Recursive unsupervised splitting tree via KMeans (K=2).

Parameters:

n_componentsint, default=10: The number of components in PCA or SparsePCA.
d_reduction_method{‘pca’, ‘sparse_pca’}, default=’pca’: The dimension reduction algorithm.
alphafloat, default=1: The sparsity parameter in SparsePCA.
max_depthint, default=3: The max depth of the tree.
max_leavesint, default=64: The max number of leaves.
min_samples_leaffloat, default=50: The minimum number of samples of a leaf node.
min_distancefloat, default=0.2: The minimum square root distance for splitting a node.
distance_measure{“ReconstErr”, “Euclidean”, “PSI”, “KS”, “WD1”} or callable function, default=”ReconstErr”: The distance measure of two clusters. Here we provide four built-in distance measures, and you may also input a callable function that calculate the distance of two given samples (e.g., one with shape n_samples1 * n_features, and another with shape n_samples2 * n_features)
distance_measure_paramdict, default=None: The custom parameters of the user defined distance measure function. Only used when distance_measure is a callable function
standardizationbool, default=True: Whether to standardize covariates before running the algorithm.
random_stateint, default=0: The random seed.

Attributes:

n_features_in_int: The number of input features.
is_fitted_bool: Indicator of whether the model is fitted.
tree_dict: The fitted tree information.
node_count_int: The total number of nodes.
leaf_idx_list_list of int: The list of leaf node ids.

Methods

`decision_function`(X[, scale])	Predict raw outliers score of X using the fitted detector.
`decision_path`(X[, scale])	Returns the decision path per sample.
`fit`(X[, y, sample_weight])	Fit the model.
`get_metadata_routing`()	Get metadata routing of this object.
`get_params`([deep])	Get parameters for this estimator.
`get_rule`([node_id, left])	Returns the splitting rule of a given node id.
`plot_tree`([draw_depth, start_node_id, figsize])	Draw the tree diagram.
`predict`([X, scale, threshold])	Predict raw outlier indicator.
`predict_leaf_id`(X)	Returns the predicted cluster (leaf node id) per sample.
`set_params`(**params)	Set the parameters of this estimator.

calculate_spca
dist

decision_function(X, scale=True)¶

Predict raw outliers score of X using the fitted detector.: For consistency, outliers are assigned with larger anomaly scores.

Parameters:

Xnumpy array of shape (n_samples, n_features): The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
scalebool, default=True: If True, scale X before calculating the outlier score.

Returns:

outlier_scoresnumpy array of shape (n_samples,): The anomaly score of the input samples.

decision_path(X, scale=True)¶

Returns the decision path per sample.

Parameters:

Xnp.ndarray of shape (n_samples, n_features): Data features.
scalebool, default=True: If True, scale X before calculating the outlier score.

Returns:

path_allnp.ndarray of shape (n_samples, node_count): The on/off status per sample * node.

fit(X, y=None, sample_weight=None)¶

Fit the model.

Parameters:

Xnp.ndarray of shape (n_samples, n_features): Data features.
ynp.ndarray of shape (n_samples,), default=None: Data response.
sample_weightnp.ndarray of shape (n_samples, ), default=None: Sample weight.

get_metadata_routing()¶

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routingMetadataRequest: A MetadataRequest encapsulating routing information.

get_params(deep=True)¶

Get parameters for this estimator.

Parameters:

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

paramsdict: Parameter names mapped to their values.

get_rule(node_id=1, left=True)¶

Returns the splitting rule of a given node id.

Parameters:

node_idint, default=1: The node id .
leftbool, default=True: The left child node or right child node.

Returns:

rulestr: A string describing the rule, which is a linear inequality.

plot_tree(draw_depth=inf, start_node_id=1, figsize=(10, 6))¶

Draw the tree diagram.

Parameters:

draw_depthint, default=1: The drawing depth starting from the start_node_id.
start_node_idbool, default=True: The node id that the tree diagram begins with.
figsizetuple, default=(10, 6): The size of the figure.

predict(X=None, scale=True, threshold=0.9)¶

Predict raw outlier indicator.

Normal samples are classified as 1 and outliers are classified as -1.

Parameters:

Xnumpy array of shape (n_samples, n_features): The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
scalebool, default=True: If True, scale X before calculating the outlier score.
thresholdfloat, default=0.9: The quantile threshold of outliers. For example, the samples with outlier scores greater than 90% quantile of the whole sample will be classified as outliers.

Returns:

outlier_indicatornumpy array of shape (n_samples,): The binary array indicating whether each sample is outlier.

predict_leaf_id(X)¶

Returns the predicted cluster (leaf node id) per sample.

Parameters:

Xnp.ndarray of shape (n_samples, n_features): Data features.

Returns:

predictionnp.ndarray of shape (n_samples, ): The cluster (leaf node id) per sample.

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**paramsdict: Estimator parameters.

Returns:

selfestimator instance: Estimator instance.

Examples using `piml.data.outlier_detection.KMeansTree`¶

Data Quality Check

piml.data.outlier_detection.KMeansTree¶

Examples using piml.data.outlier_detection.KMeansTree¶

`piml.data.outlier_detection`.KMeansTree¶

Examples using `piml.data.outlier_detection.KMeansTree`¶