PCA for Outlier Detection: A surprisingly effective method!

SeniorTechInfo
6 Min Read

A Surprisingly Effective Method for Outlier Detection in Numeric Data

PCA (Principal Component Analysis) is a widely used technique in the field of data science, primarily for dimensionality reduction and visualization. However, what many might not know is that PCA can also be a powerful tool for outlier detection, as I will explain in this article.

This article is part of a series on outlier detection, which includes discussions on various methods such as FPOF, Counts Outlier Detector, Distance Metric Learning, Shared Nearest Neighbors, and Doping. It also contains excerpts from my book Outlier Detection in Python.

The concept behind PCA revolves around the fact that most datasets exhibit varying levels of variance across different columns, along with correlations between features. This implies that to effectively represent the data, we may not need to utilize all the features at our disposal. In many cases, the data can be adequately approximated using a much smaller subset of features, sometimes significantly fewer. For example, in a dataset with 100 numeric features, it may be possible to represent the data quite accurately using only 30 or 40 features, or even fewer.

To achieve this, PCA transforms the data into a new coordinate system where the dimensions are referred to as components.

Given the challenges often faced with outlier detection due to the curse of dimensionality, working with fewer features can offer significant benefits. As discussed in Shared Nearest Neighbors and Distance Metric Learning for Outlier Detection, dealing with a large number of features can make outlier detection unreliable, as high-dimensional data leads to inconsistencies in distance calculations between points (which many outlier detectors rely on). PCA can help alleviate these issues.

Surprisingly, using PCA can make it easier to detect outliers. The transformations applied by PCA often reshape the data in a way that makes any unusual points more conspicuous.

An example showcasing this concept can be seen below:


import numpy as np
import pandas as pd
from sklearn.decomposition import PCA

# Create two arrays of 100 random values, with high correlation between them
x_data = np.random.random(100)
y_data = np.random.random(100) / 10.0

# Create a dataframe with this data plus two additional points
data = pd.DataFrame({'A': x_data, 'B': x_data + y_data})
data= pd.concat([data, 
pd.DataFrame([[1.8, 1.8], [0.5, 0.1]], columns=['A', 'B'])])

# Use PCA to transform the data to another 2D space
pca = PCA(n_components=2)
pca.fit(data)
print(pca.explained_variance_ratio_)

# Create a dataframe with the PCA-transformed data
new_data = pd.DataFrame(pca.transform(data), columns=['0', '1'])

This code snippet illustrates the transformation of a dataset using PCA, showcasing how outliers can be easily identified in the transformed space.

PyODKernelPCA

PyOD offers a class called PyODKernelPCA, which serves as a wrapper around scikit-learn’s KernelPCA class. This class provides PCA transformation capabilities, allowing for nonlinear transformations of the data, which can enhance outlier detection in complex scenarios.

To utilize PyODKernelPCA effectively, understanding kernel functions and their impact on data transformation is crucial. These functions, similar to those used in SVM models, can efficiently reshape the space, making outliers more distinguishable.

Scikit-learn offers various kernels such as linear, polynomial, radial basis function, sigmoidal, and cosine, each tailored to specific data characteristics. Choosing the right kernel is essential for optimizing outlier detection performance.

Continuing from the previous example, let’s explore how non-linear transformations using PyODKernelPCA can enhance outlier detection capabilities:

det = PyODKernelPCA(kernel='rbf')

By selecting an appropriate kernel function, PyODKernelPCA can provide valuable insights into detecting outliers in transformed data spaces.

The PCA Detector

PyOD offers two PCA-based outlier detectors, namely PCA and KPCA. These detectors leverage reconstruction errors through Euclidean distance computations to identify outliers in the dataset. The first few components capturing the main data patterns play a crucial role in signaling outliers that deviate significantly from these patterns.

Performing outlier detection using PCA requires careful consideration of the components that best represent the data, as outliers are generally detected based on deviations from these dominant patterns.

It’s essential to clean the data from obvious outliers before fitting the PCA detector, ensuring that the model accurately captures the underlying data distribution. Furthermore, using interpretable detectors like ECOD can aid in preprocessing the data to enhance the efficacy of PCA-based outlier detection.

In the next article, we will delve deeper into conducting tests on PCA-transformed data, exploring various outlier detection techniques, and assessing their efficiency in terms of speed, memory usage, and accuracy.

Stay tuned for more insights into leveraging PCA for powerful outlier detection strategies.

Share This Article
Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *