Python实战：用Scikit-learn搞定异常检测中的三种异常类型（附代码示例）-酒店常州论坛

Python实战：用Scikit-learn搞定异常检测中的三种异常类型（附代码示例）

异常检测是数据科学中一个既有趣又充满挑战的领域。想象一下，你正在监控一家大型电商平台的交易数据，突然发现某个用户的购买行为与正常模式截然不同——这可能是一笔欺诈交易，也可能是系统错误，甚至可能是一个新的商业机会。这就是异常检测的魅力所在：它不仅能帮我们发现问题，还能发现隐藏的价值。

在Python生态中，Scikit-learn为我们提供了强大的工具来处理各种异常检测场景。本文将聚焦三种最常见的异常类型：点异常、上下文异常和集合异常。不同于理论性的概述，我们会通过完整的代码示例，从数据准备到模型评估，一步步带你掌握实战技能。

1. 异常检测基础与环境准备

在开始之前，让我们先确保环境配置正确。你需要安装以下Python库：

pip install numpy pandas matplotlib scikit-learn seaborn

异常检测的核心是识别数据中不符合预期模式的数据点。根据异常的性质，我们可以将其分为三类：

点异常(Point Anomalies): 单个数据点明显偏离整体分布
上下文异常(Contextual Anomalies): 在特定上下文中表现异常的数据点
集合异常(Collective Anomalies): 一组数据点集体表现出异常行为

下面是一个简单的数据生成函数，我们将用它来创建包含各种异常类型的示例数据：

import numpy as np import pandas as pd from sklearn.datasets import make_blobs def generate_anomaly_data(n_samples=1000, random_state=42): # 生成正常数据 X_normal, _ = make_blobs(n_samples=n_samples, centers=1, cluster_std=1.0, random_state=random_state) # 添加点异常 point_anomalies = np.random.uniform(low=-10, high=10, size=(20, 2)) # 添加上下文异常 X_context = X_normal.copy() context_idx = np.random.choice(len(X_normal), 20, replace=False) X_context[context_idx] += 15 # 添加集合异常 collective_anomalies, _ = make_blobs(n_samples=30, centers=1, cluster_std=0.3, random_state=random_state) collective_anomalies += [15, 0] # 合并所有数据 X = np.vstack([X_normal, point_anomalies, X_context[context_idx], collective_anomalies]) y = np.array([0]*len(X_normal) + [1]*len(point_anomalies) + [2]*len(context_idx) + [3]*len(collective_anomalies)) return X, y

2. 点异常检测实战

点异常是最容易理解的异常类型——它们就像人群中的"异类"，一眼就能识别出来。在Scikit-learn中，我们有多种算法可以处理这类问题。

2.1 Isolation Forest算法

Isolation Forest基于一个简单的原理：异常点更容易被"隔离"。让我们看看如何实现：

from sklearn.ensemble import IsolationForest from sklearn.metrics import classification_report # 生成数据 X, y = generate_anomaly_data() point_mask = (y == 1) # 点异常标签为1 # 训练模型 clf = IsolationForest(n_estimators=100, contamination=0.05, random_state=42) clf.fit(X) # 预测 preds = clf.predict(X) preds = np.where(preds == -1, 1, 0) # 将-1/1转换为1/0 # 评估 print(classification_report(point_mask, preds))

2.2 局部离群因子(LOF)

LOF通过比较数据点的局部密度来识别异常：

from sklearn.neighbors import LocalOutlierFactor lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05) preds = lof.fit_predict(X) preds = np.where(preds == -1, 1, 0) print(classification_report(point_mask, preds))

提示：在实际应用中，contamination参数需要根据你对异常比例的估计进行调整。可以先设置为"auto"，让算法自动确定。

3. 上下文异常检测技术

上下文异常更加微妙——它们只在特定情境下才表现出异常。例如，夏天穿羽绒服在北极是正常的，但在热带就是异常。

3.1 基于时间序列的上下文异常检测

对于时间序列数据，我们可以使用One-Class SVM：

from sklearn.svm import OneClassSVM from sklearn.preprocessing import StandardScaler # 创建时间序列数据 np.random.seed(42) time = np.arange(500) values = np.sin(time * 0.1) + np.random.normal(0, 0.1, 500) # 添加上下文异常 values[200:210] += 2 # 短期突增 values[300:320] -= 1 # 短期突降 # 转换为监督学习格式 def create_features(values, window_size=10): X = [] for i in range(len(values)-window_size): X.append(values[i:i+window_size]) return np.array(X) X = create_features(values) y = np.zeros(len(values)-10) y[190:200] = 1 # 标记异常 # 训练模型 scaler = StandardScaler() X_scaled = scaler.fit_transform(X) ocsvm = OneClassSVM(nu=0.05, kernel="rbf", gamma=0.1) ocsvm.fit(X_scaled) # 预测 preds = ocsvm.predict(X_scaled) preds = np.where(preds == -1, 1, 0) print(classification_report(y, preds))

3.2 基于上下文的聚类方法

DBSCAN算法可以识别不同密度区域中的异常：

from sklearn.cluster import DBSCAN # 使用之前生成的数据 dbscan = DBSCAN(eps=0.5, min_samples=10) clusters = dbscan.fit_predict(X) # 将噪声点(-1)标记为异常 preds = np.where(clusters == -1, 1, 0) context_mask = (y == 2) # 上下文异常标签为2 print(classification_report(context_mask, preds))

4. 集合异常检测方法

集合异常指的是一组数据点共同表现出异常模式，而单个点可能看起来正常。这类异常在网络安全、工业设备监控中很常见。

4.1 基于自编码器的检测

自编码器可以学习数据的正常模式，然后重构误差大的区域可能包含集合异常：

from keras.models import Model, Sequential from keras.layers import Dense, Input from sklearn.preprocessing import MinMaxScaler # 准备数据 scaler = MinMaxScaler() X_scaled = scaler.fit_transform(X) # 构建自编码器 input_dim = X_scaled.shape[1] encoding_dim = 2 input_layer = Input(shape=(input_dim,)) encoder = Dense(encoding_dim, activation="relu")(input_layer) decoder = Dense(input_dim, activation="sigmoid")(encoder) autoencoder = Model(inputs=input_layer, outputs=decoder) autoencoder.compile(optimizer="adam", loss="mse") autoencoder.fit(X_scaled, X_scaled, epochs=50, batch_size=32, shuffle=True) # 计算重构误差 reconstructions = autoencoder.predict(X_scaled) mse = np.mean(np.power(X_scaled - reconstructions, 2), axis=1) # 标记异常 threshold = np.quantile(mse, 0.95) preds = (mse > threshold).astype(int) collective_mask = (y == 3) # 集合异常标签为3 print(classification_report(collective_mask, preds))

4.2 基于时间窗口的统计方法

对于时间序列中的集合异常，滑动窗口统计方法很有效：

def sliding_window_anomaly_detection(values, window_size=30, z_threshold=3): anomalies = np.zeros_like(values) for i in range(len(values)-window_size): window = values[i:i+window_size] mean, std = np.mean(window), np.std(window) if std == 0: continue z_score = abs((values[i+window_size] - mean) / std) if z_score > z_threshold: anomalies[i+window_size] = 1 return anomalies # 应用检测 values = np.sin(np.arange(500)*0.1) + np.random.normal(0, 0.1, 500) values[200:220] = 0 # 添加集合异常 anomalies = sliding_window_anomaly_detection(values) # 可视化 import matplotlib.pyplot as plt plt.figure(figsize=(12,6)) plt.plot(values, label="Value") plt.scatter(np.where(anomalies==1)[0], values[anomalies==1], color="red", label="Detected Anomalies") plt.legend() plt.show()

5. 模型评估与调优技巧

选择正确的评估指标对异常检测至关重要。由于异常检测通常是不平衡分类问题，准确率不是最佳指标。

5.1 评估指标对比

指标	适用场景	优点	缺点
Precision	误报成本高时	关注预测为异常的准确性	可能漏掉真实异常
Recall	漏报成本高时	尽可能捕获所有异常	可能有更多误报
F1 Score	平衡Precision和Recall	综合评估模型性能	对不平衡数据敏感
ROC AUC	比较不同模型	不受阈值影响	可能过于乐观

5.2 参数调优示例

以Isolation Forest为例，我们可以使用GridSearchCV进行参数优化：

from sklearn.model_selection import GridSearchCV param_grid = { "n_estimators": [50, 100, 200], "max_samples": ["auto", 0.5, 0.8], "contamination": [0.01, 0.05, 0.1] } clf = GridSearchCV(IsolationForest(random_state=42), param_grid, scoring="f1", cv=3) clf.fit(X, point_mask) print("最佳参数:", clf.best_params_) print("最佳F1分数:", clf.best_score_)

注意：在实际应用中，参数搜索空间需要根据数据规模和计算资源进行调整。对于大型数据集，可以考虑使用RandomizedSearchCV。

6. 异常检测实战案例

让我们通过一个电商交易数据的案例，综合应用前面学到的技术。假设我们有以下特征：

交易金额
交易时间(小时)
商品类别
用户历史购买频率

# 模拟电商交易数据 np.random.seed(42) n_samples = 1000 # 正常交易 amount = np.random.lognormal(3, 0.5, n_samples) hour = np.random.randint(0, 24, n_samples) category = np.random.choice(5, n_samples) frequency = np.random.poisson(5, n_samples) # 添加各种异常 # 点异常: 极高金额交易 amount[-20:] = np.random.uniform(10000, 50000, 20) # 上下文异常: 凌晨高频购买低频商品 hour[-40:-20] = 3 category[-40:-20] = 4 # 最不常见的类别 frequency[-40:-20] = 20 # 远高于平均 # 集合异常: 短时间内相同类别大量购买 hour[-60:-40] = np.random.randint(10,12,20) category[-60:-40] = 2 amount[-60:-40] = np.random.uniform(500,1000,20) # 创建DataFrame data = pd.DataFrame({ "amount": amount, "hour": hour, "category": category, "frequency": frequency }) # 标记异常 labels = np.zeros(n_samples) labels[-20:] = 1 # 点异常 labels[-40:-20] = 2 # 上下文异常 labels[-60:-40] = 3 # 集合异常

6.1 特征工程

from sklearn.preprocessing import StandardScaler from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline # 数值特征标准化，类别特征one-hot编码 numeric_features = ["amount", "hour", "frequency"] numeric_transformer = Pipeline([ ("scaler", StandardScaler()) ]) categorical_features = ["category"] categorical_transformer = Pipeline([ ("onehot", OneHotEncoder(handle_unknown="ignore")) ]) preprocessor = ColumnTransformer([ ("num", numeric_transformer, numeric_features), ("cat", categorical_transformer, categorical_features) ]) X_processed = preprocessor.fit_transform(data)

6.2 综合异常检测模型

我们可以组合多个检测器来提高性能：

from sklearn.ensemble import VotingClassifier from sklearn.base import BaseEstimator, ClassifierMixin class AnomalyDetectorWrapper(BaseEstimator, ClassifierMixin): def __init__(self, detector): self.detector = detector def fit(self, X, y=None): self.detector.fit(X) return self def predict(self, X): return self.detector.predict(X) # 创建投票系统 clf1 = AnomalyDetectorWrapper(IsolationForest(contamination=0.1)) clf2 = AnomalyDetectorWrapper(OneClassSVM(nu=0.1)) clf3 = AnomalyDetectorWrapper(LocalOutlierFactor(n_neighbors=20, contamination=0.1)) voting_clf = VotingClassifier( estimators=[("if", clf1), ("ocsvm", clf2), ("lof", clf3)], voting="hard" ) voting_clf.fit(X_processed) preds = voting_clf.predict(X_processed) preds = np.where(preds == -1, 1, 0) print(classification_report(labels > 0, preds))

在实际项目中，我发现组合多种检测算法通常比单一模型表现更好，特别是在处理不同类型的异常时。例如，Isolation Forest对点异常敏感，而One-Class SVM更擅长捕捉上下文异常。通过投票或堆叠(Stacking)方式组合它们，可以获得更稳健的检测结果。

企业官网建设流程全解析