piml.data.outlier_detection
.KMeansTree¶
- class piml.data.outlier_detection.KMeansTree(n_components=10, d_reduction_method='pca', alpha=1, max_depth=4, max_leaves=10, min_samples_leaf=50, min_distance=0.2, distance_measure='ReconstErr', distance_measure_param=None, standardization=True, random_state=0)¶
Recursive unsupervised splitting tree via KMeans (K=2).
- Parameters:
- n_componentsint, default=10
The number of components in PCA or SparsePCA.
- d_reduction_method{‘pca’, ‘sparse_pca’}, default=’pca’
The dimension reduction algorithm.
- alphafloat, default=1
The sparsity parameter in SparsePCA.
- max_depthint, default=3
The max depth of the tree.
- max_leavesint, default=64
The max number of leaves.
- min_samples_leaffloat, default=50
The minimum number of samples of a leaf node.
- min_distancefloat, default=0.2
The minimum square root distance for splitting a node.
- distance_measure{“ReconstErr”, “Euclidean”, “PSI”, “KS”, “WD1”} or callable function, default=”ReconstErr”
The distance measure of two clusters. Here we provide four built-in distance measures, and you may also input a callable function that calculate the distance of two given samples (e.g., one with shape n_samples1 * n_features, and another with shape n_samples2 * n_features)
- distance_measure_paramdict, default=None
The custom parameters of the user defined distance measure function. Only used when distance_measure is a callable function
- standardizationbool, default=True
Whether to standardize covariates before running the algorithm.
- random_stateint, default=0
The random seed.
- Attributes:
- n_features_in_int
The number of input features.
- is_fitted_bool
Indicator of whether the model is fitted.
- tree_dict
The fitted tree information.
- node_count_int
The total number of nodes.
- leaf_idx_list_list of int
The list of leaf node ids.
Methods
decision_function
(X[, scale])Predict raw outliers score of X using the fitted detector.
decision_path
(X[, scale])Returns the decision path per sample.
fit
(X[, y, sample_weight])Fit the model.
Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
get_rule
([node_id, left])Returns the splitting rule of a given node id.
plot_tree
([draw_depth, start_node_id, figsize])Draw the tree diagram.
predict
([X, scale, threshold])Predict raw outlier indicator.
Returns the predicted cluster (leaf node id) per sample.
set_params
(**params)Set the parameters of this estimator.
calculate_spca
dist
- decision_function(X, scale=True)¶
- Predict raw outliers score of X using the fitted detector.
For consistency, outliers are assigned with larger anomaly scores.
- Parameters:
- Xnumpy array of shape (n_samples, n_features)
The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
- scalebool, default=True
If True, scale X before calculating the outlier score.
- Returns:
- outlier_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- decision_path(X, scale=True)¶
Returns the decision path per sample.
- Parameters:
- Xnp.ndarray of shape (n_samples, n_features)
Data features.
- scalebool, default=True
If True, scale X before calculating the outlier score.
- Returns:
- path_allnp.ndarray of shape (n_samples, node_count)
The on/off status per sample * node.
- fit(X, y=None, sample_weight=None)¶
Fit the model.
- Parameters:
- Xnp.ndarray of shape (n_samples, n_features)
Data features.
- ynp.ndarray of shape (n_samples,), default=None
Data response.
- sample_weightnp.ndarray of shape (n_samples, ), default=None
Sample weight.
- get_metadata_routing()¶
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequest
encapsulating routing information.
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters:
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
- paramsdict
Parameter names mapped to their values.
- get_rule(node_id=1, left=True)¶
Returns the splitting rule of a given node id.
- Parameters:
- node_idint, default=1
The node id .
- leftbool, default=True
The left child node or right child node.
- Returns:
- rulestr
A string describing the rule, which is a linear inequality.
- plot_tree(draw_depth=inf, start_node_id=1, figsize=(10, 6))¶
Draw the tree diagram.
- Parameters:
- draw_depthint, default=1
The drawing depth starting from the start_node_id.
- start_node_idbool, default=True
The node id that the tree diagram begins with.
- figsizetuple, default=(10, 6)
The size of the figure.
- predict(X=None, scale=True, threshold=0.9)¶
Predict raw outlier indicator.
Normal samples are classified as 1 and outliers are classified as -1.
- Parameters:
- Xnumpy array of shape (n_samples, n_features)
The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
- scalebool, default=True
If True, scale X before calculating the outlier score.
- thresholdfloat, default=0.9
The quantile threshold of outliers. For example, the samples with outlier scores greater than 90% quantile of the whole sample will be classified as outliers.
- Returns:
- outlier_indicatornumpy array of shape (n_samples,)
The binary array indicating whether each sample is outlier.
- predict_leaf_id(X)¶
Returns the predicted cluster (leaf node id) per sample.
- Parameters:
- Xnp.ndarray of shape (n_samples, n_features)
Data features.
- Returns:
- predictionnp.ndarray of shape (n_samples, )
The cluster (leaf node id) per sample.
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
- **paramsdict
Estimator parameters.
- Returns:
- selfestimator instance
Estimator instance.