从Excel到AUC：一份给数据科学新手的sklearn.metrics.roc_auc_score保姆级实操指南-酒店常州论坛

从Excel到AUC：一份给数据科学新手的sklearn.metrics.roc_auc_score保姆级实操指南

在数据科学的世界里，评估模型性能是每个从业者必须掌握的技能。而AUC（Area Under Curve）作为衡量二分类模型性能的重要指标，常常让初学者感到困惑——理论懂了，但面对实际数据时却无从下手。本文将带你一步步解决这个"最后一公里"问题，从Excel表格中的数据到最终的AUC计算结果。

1. 理解AUC的核心概念

AUC衡量的是模型区分正负样本的能力。想象你是一名医生，需要根据检查结果判断患者是否患病。AUC就是评估你的诊断方法有多准确。数值范围在0.5到1之间：

0.5：相当于随机猜测
0.7-0.8：还算不错
0.8-0.9：很好
0.9以上：非常优秀

常见误区：

认为AUC越高模型就一定越好（实际上需要考虑业务场景）
混淆AUC和准确率的概念
忽略数据不平衡对AUC的影响

2. 准备你的数据环境

在开始计算前，我们需要确保Python环境已经配置好必要的工具包。推荐使用Anaconda创建虚拟环境：

conda create -n auc_env python=3.8 conda activate auc_env pip install pandas scikit-learn openpyxl

典型的Excel数据结构可能包含以下列：

id	label	pred_label	pred_score	model_predict_scores
1	0	0	0.85	[0.85, 0.15]
2	1	1	0.92	[0.08, 0.92]

关键点：

label：真实标签（0或1）
pred_score：模型预测为正类的概率
model_predict_scores：模型对两个类别的预测概率（通常以列表形式存储）

3. 从Excel到Python：数据读取与处理

使用pandas读取Excel数据是最直接的方法：

import pandas as pd # 读取Excel文件 df = pd.read_excel('your_data.xlsx', engine='openpyxl') # 查看前几行数据 print(df.head())

常见问题解决方案：

字符串转换问题：当model_predict_scores列以字符串形式存储时，需要转换为数值列表：

import ast # 安全地将字符串转换为列表 df['model_predict_scores'] = df['model_predict_scores'].apply( lambda x: ast.literal_eval(x) if isinstance(x, str) else x )

缺失值处理：

# 检查缺失值 print(df.isnull().sum()) # 简单处理：删除含有缺失值的行 df = df.dropna()

数据类型验证：

# 确保label是整数类型 df['label'] = df['label'].astype(int)

4. 提取正确的输入特征

roc_auc_score函数需要两个核心输入：

y_true：真实标签
y_score：预测分数（通常是正类的概率）

正确提取方法：

# 方法1：直接使用pred_score列（如果已经是正类概率） y_true = df['label'] y_score = df['pred_score'] # 方法2：从model_predict_scores中提取正类概率 # 假设正类是第二个元素（索引1） y_score_from_array = df['model_predict_scores'].apply(lambda x: x[1])

重要检查：

# 验证y_score的取值范围 print(f"预测分数范围: {y_score.min()} - {y_score.max()}") # 验证标签分布 print("标签分布:") print(y_true.value_counts())

5. 计算AUC并解读结果

终于到了计算环节：

from sklearn.metrics import roc_auc_score # 计算AUC auc_score = roc_auc_score(y_true, y_score) print(f"AUC值为: {auc_score:.4f}")

结果解读指南：

AUC值范围	模型表现评价	建议行动
0.5-0.6	差于随机猜测	检查数据或模型基本假设
0.6-0.7	勉强可用	考虑特征工程或尝试其他模型
0.7-0.8	不错	可以用于生产，但仍有改进空间
0.8-0.9	很好	满足大多数业务需求
0.9-1.0	非常优秀	检查是否过拟合

进阶技巧：

绘制ROC曲线直观展示结果：

from sklearn.metrics import RocCurveDisplay import matplotlib.pyplot as plt RocCurveDisplay.from_predictions(y_true, y_score) plt.title('ROC曲线') plt.show()

计算置信区间（使用bootstrapping）：

import numpy as np n_bootstraps = 1000 auc_scores = [] for _ in range(n_bootstraps): indices = np.random.randint(0, len(y_true), len(y_true)) auc = roc_auc_score(y_true.iloc[indices], y_score.iloc[indices]) auc_scores.append(auc) print(f"AUC 95%置信区间: {np.percentile(auc_scores, 2.5):.3f} - {np.percentile(auc_scores, 97.5):.3f}")

6. 常见陷阱与解决方案

陷阱1：错误的y_score格式

症状：AUC结果异常（如接近0或1）
解决：确保y_score是正类的概率，不是类别标签

陷阱2：极度不平衡数据

症状：AUC看起来不错但实际业务效果差
解决：同时查看精确率-召回率曲线

from sklearn.metrics import precision_recall_curve precision, recall, _ = precision_recall_curve(y_true, y_score) plt.plot(recall, precision) plt.xlabel('Recall') plt.ylabel('Precision') plt.title('PR曲线') plt.show()

陷阱3：多分类问题误用

症状：报错"ValueError: multiclass format is not supported"
解决：对于多分类问题，需要使用multi_class参数或改为OvR策略

陷阱4：数据泄露

症状：AUC异常高
解决：确保训练集和测试集严格分离

7. 实际案例：从原始数据到完整分析

让我们通过一个完整案例巩固所学知识。假设我们有一个电商用户流失预测模型的输出结果：

数据概览：

df = pd.read_excel('customer_churn.xlsx') print(f"数据维度: {df.shape}") print(df.sample(3))

数据清洗：

# 处理缺失值 df = df.dropna(subset=['label', 'pred_score']) # 转换数据类型 df['label'] = df['label'].astype(int) # 处理异常值 df = df[(df['pred_score'] >= 0) & (df['pred_score'] <= 1)]

特征工程（可选）：

# 如果需要，可以在这里添加特征变换 # 例如对数变换 df['log_pred_score'] = np.log(df['pred_score'] + 1e-6)

模型评估：

from sklearn.metrics import ( roc_auc_score, accuracy_score, confusion_matrix, classification_report ) y_true = df['label'] y_score = df['pred_score'] # 计算各项指标 print(f"AUC: {roc_auc_score(y_true, y_score):.4f}") print(f"准确率: {accuracy_score(y_true, (y_score > 0.5).astype(int)):.4f}") print("\n分类报告:") print(classification_report(y_true, (y_score > 0.5).astype(int))) # 绘制混淆矩阵 sns.heatmap(confusion_matrix(y_true, (y_score > 0.5).astype(int)), annot=True, fmt='d') plt.title('混淆矩阵') plt.show()

业务解读：
- 根据AUC值评估模型区分能力
- 结合混淆矩阵分析误分类成本
- 确定最佳概率阈值（不一定是0.5）

# 寻找最佳阈值 from sklearn.metrics import f1_score thresholds = np.linspace(0, 1, 100) f1_scores = [f1_score(y_true, (y_score > t).astype(int)) for t in thresholds] best_threshold = thresholds[np.argmax(f1_scores)] print(f"最��F1分数阈值: {best_threshold:.3f}")

8. 性能优化与高级技巧

当数据量很大时，计算AUC可能变慢。以下是优化建议：

使用更高效的数据结构：

# 将pandas Series转换为numpy数组 y_true_np = y_true.values y_score_np = y_score.values

并行计算（适用于大型数据集）：

from joblib import Parallel, delayed def bootstrap_auc(y_true, y_score, n_iter=100): def _one_iter(): indices = np.random.randint(0, len(y_true), len(y_true)) return roc_auc_score(y_true[indices], y_score[indices]) return Parallel(n_jobs=-1)(delayed(_one_iter)() for _ in range(n_iter)) auc_dist = bootstrap_auc(y_true_np, y_score_np)

增量计算（流数据场景）：

from sklearn.metrics import roc_curve def incremental_auc(y_true, y_score, chunk_size=1000): aucs = [] for i in range(0, len(y_true), chunk_size): chunk_true = y_true[i:i+chunk_size] chunk_score = y_score[i:i+chunk_size] aucs.append(roc_auc_score(chunk_true, chunk_score)) return np.mean(aucs)

GPU加速（使用cuML）：

from cuml.metrics import roc_auc_score as gpu_roc_auc_score # 将数据转移到GPU import cupy as cp y_true_gpu = cp.asarray(y_true_np) y_score_gpu = cp.asarray(y_score_np) gpu_auc = gpu_roc_auc_score(y_true_gpu, y_score_gpu)

9. 将流程封装为可重用函数

为了日后方便使用，我们可以将整个流程封装成函数：

def calculate_auc_from_excel(file_path, sheet_name=0, true_col='label', score_col='pred_score', needs_cleaning=True): """ 从Excel文件计算AUC的完整流程 参数: file_path: Excel文件路径 sheet_name: 工作表名称或索引 true_col: 真实标签列名 score_col: 预测分数列名 needs_cleaning: 是否需要数据清洗 返回: auc_score: 计算得到的AUC值 df: 处理后的DataFrame """ # 读取数据 try: df = pd.read_excel(file_path, sheet_name=sheet_name, engine='openpyxl') except Exception as e: raise ValueError(f"读取Excel文件失败: {str(e)}") # 数据清洗 if needs_cleaning: # 检查必要列是否存在 required_cols = [true_col, score_col] missing_cols = [col for col in required_cols if col not in df.columns] if missing_cols: raise ValueError(f"缺少必要列: {missing_cols}") # 处理缺失值 df = df.dropna(subset=required_cols) # 转换数据类型 df[true_col] = df[true_col].astype(int) df[score_col] = pd.to_numeric(df[score_col], errors='coerce') df = df.dropna(subset=[score_col]) # 移除超出范围的预测分数 df = df[(df[score_col] >= 0) & (df[score_col] <= 1)] # 计算AUC try: auc_score = roc_auc_score(df[true_col], df[score_col]) except ValueError as e: raise ValueError(f"AUC计算错误: {str(e)}") return auc_score, df

使用示例：

# 使用函数计算AUC try: auc, processed_df = calculate_auc_from_excel( 'customer_churn.xlsx', true_col='churn_label', score_col='churn_probability' ) print(f"计算完成，AUC值为: {auc:.4f}") print(f"处理后数据行数: {len(processed_df)}") except Exception as e: print(f"处理失败: {str(e)}")

10. 扩展应用：AUC在不同场景下的变体

虽然我们主要讨论了二分类AUC，但这一概念可以扩展到其他场景：

多分类问题的AUC计算：

# 使用一对多(OvR)策略 from sklearn.preprocessing import label_binarize # 假设我们有3个类别 y_true_multiclass = label_binarize(y_true, classes=[0, 1, 2]) auc_scores = [] for i in range(3): auc = roc_auc_score(y_true_multiclass[:, i], y_scores[:, i]) auc_scores.append(auc) print(f"各类别AUC: {auc_scores}") print(f"宏观平均AUC: {np.mean(auc_scores):.4f}")

排序问题的AUC应用：

# 计算排序AUC def rank_auc(y_true, y_score): from itertools import combinations pos = y_true == 1 neg = y_true == 0 pos_scores = y_score[pos] neg_scores = y_score[neg] correct = 0 total = 0 for p in pos_scores: for n in neg_scores: total += 1 if p > n: correct += 1 return correct / total if total > 0 else 0.5 print(f"排序AUC: {rank_auc(y_true, y_score):.4f}")

时间序列预测的AUC计算：

# 对时间序列数据，可能需要考虑时间因素 def time_aware_auc(y_true, y_score, timestamps, time_window='30D'): df = pd.DataFrame({ 'true': y_true, 'score': y_score, 'time': timestamps }) # 按时间窗口分组计算AUC window_aucs = [] for _, group in df.groupby(pd.Grouper(key='time', freq=time_window)): if len(group) > 10: # 确保有足够样本 auc = roc_auc_score(group['true'], group['score']) window_aucs.append(auc) return window_aucs

11. 与其他指标的对比分析

虽然AUC很有用，但决策时应该综合考虑多个指标：

指标	优点	缺点	适用场景
AUC	不受阈值影响，综合评估性能	对极度不平衡数据可能过于乐观	模型选择，早期评估
准确率	直观易懂	不平衡数据时误导性高	平衡数据集
F1分数	平衡精确率和召回率	依赖阈值选择	需要平衡两类错误的情况
精确率	关注预测为正类的准确性	忽略负类的预测情况	误报成本高的场景
召回率	关注找出所有正类的能力	可能以增加误报为代价	漏报成本高的场景
PR曲线下面积	对不平衡数据更敏感	解释性稍差	高度不平衡数据

组合使用建议：

from sklearn.metrics import ( accuracy_score, f1_score, precision_score, recall_score, average_precision_score ) metrics = { 'AUC': roc_auc_score(y_true, y_score), 'Accuracy': accuracy_score(y_true, y_score > 0.5), 'F1': f1_score(y_true, y_score > 0.5), 'Precision': precision_score(y_true, y_score > 0.5), 'Recall': recall_score(y_true, y_score > 0.5), 'AP': average_precision_score(y_true, y_score) # 平均精确率 } for name, value in metrics.items(): print(f"{name}: {value:.4f}")

12. 自动化报告生成

为了便于分享和存档，我们可以自动生成评估报告：

from io import StringIO import sys from matplotlib.backends.backend_pdf import PdfPages def generate_auc_report(y_true, y_score, output_path='auc_report.pdf'): # 捕获所有输出 old_stdout = sys.stdout sys.stdout = report_buffer = StringIO() # 计算各项指标 print("模型评���报告\n" + "="*50) print(f"\n评估时间: {pd.Timestamp.now()}") print(f"\n样本数量: {len(y_true)}") print(f"正样本比例: {y_true.mean():.2%}") print("\n主要指标:") metrics = { 'AUC': roc_auc_score(y_true, y_score), 'Accuracy': accuracy_score(y_true, y_score > 0.5), 'F1': f1_score(y_true, y_score > 0.5), 'Precision': precision_score(y_true, y_score > 0.5), 'Recall': recall_score(y_true, y_score > 0.5) } for name, value in metrics.items(): print(f"- {name}: {value:.4f}") # 绘制图表 with PdfPages(output_path) as pdf: # ROC曲线 plt.figure() RocCurveDisplay.from_predictions(y_true, y_score) plt.title('ROC曲线') pdf.savefig() plt.close() # 精确率-召回率曲线 plt.figure() PrecisionRecallDisplay.from_predictions(y_true, y_score) plt.title('PR曲线') pdf.savefig() plt.close() # 分数分布 plt.figure() plt.hist(y_score[y_true == 0], bins=30, alpha=0.5, label='负类') plt.hist(y_score[y_true == 1], bins=30, alpha=0.5, label='正类') plt.legend() plt.title('预测分数分布') pdf.savefig() plt.close() # 恢复标准输出 sys.stdout = old_stdout report_text = report_buffer.getvalue() # 将文本添加到PDF with PdfPages(output_path) as pdf: plt.figure(figsize=(8, 11)) plt.text(0.1, 0.9, report_text, ha='left', va='top', fontsize=8) plt.axis('off') pdf.savefig() plt.close() print(f"报告已保存至: {output_path}") # 使用示例 generate_auc_report(y_true, y_score)

13. 在生产环境中的最佳实践

将AUC计算集成到生产环境时，需要注意：

日志记录：

import logging logging.basicConfig( filename='model_monitoring.log', level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s' ) def log_auc(y_true, y_score, model_name='default'): auc = roc_auc_score(y_true, y_score) logging.info( f"模型 {model_name} 评估 - AUC: {auc:.4f} " f"样本数: {len(y_true)} 正样本比例: {y_true.mean():.2%}" ) return auc

监控AUC变化：

from datetime import datetime, timedelta class AUCMonitor: def __init__(self, window_size=7): self.results = [] self.window_size = window_size def add_result(self, auc, timestamp=None): if timestamp is None: timestamp = datetime.now() self.results.append((timestamp, auc)) self.results = self.results[-self.window_size:] def check_degradation(self, threshold=0.05): if len(self.results) < 2: return False current = self.results[-1][1] previous = self.results[-2][1] return (previous - current) > threshold # 使用示例 monitor = AUCMonitor() monitor.add_result(0.85) monitor.add_result(0.82) if monitor.check_degradation(): print("警告: 模型性能下降超过5%!")

自动化警报系统：

import smtplib from email.mime.text import MIMEText def send_alert(subject, message, to_emails): msg = MIMEText(message) msg['Subject'] = subject msg['From'] = 'monitoring@yourcompany.com' msg['To'] = ', '.join(to_emails) with smtplib.SMTP('smtp.yourcompany.com') as server: server.send_message(msg) # 结合监控使用 if monitor.check_degradation(threshold=0.1): send_alert( "模型性能警报", f"模型AUC下降超过10%，当前值: {monitor.results[-1][1]:.4f}", ['data_team@yourcompany.com'] )

14. 调试技巧：当AUC不符合预期时

遇到AUC异常时，可以按照以下步骤排查：

验证数据输入：
- 检查y_true是否只包含0和1
- 确认y_score在[0,1]范围内
- 确保两个数组长度相同

检查数据分布：

print("真实标签分布:") print(pd.Series(y_true).value_counts()) print("\n预测分数分布:") print(pd.Series(y_score).describe()) plt.figure(figsize=(10, 4)) plt.subplot(1, 2, 1) plt.hist(y_score[y_true == 0], bins=30, alpha=0.5, label='负类') plt.hist(y_score[y_true == 1], bins=30, alpha=0.5, label='正类') plt.legend() plt.title('预测分数分布') plt.subplot(1, 2, 2) RocCurveDisplay.from_predictions(y_true, y_score) plt.title('ROC曲线') plt.tight_layout() plt.show()

常见问题修复：
问题1：AUC=0.5
- 可能原因：模型没有学习到任何模式
- 解决：检查特征工程、模型训练过程
问题2：AUC<0.5
- 可能原因：标签定义反了
- 解决：尝试反转预测分数y_score = 1 - y_score
问题3：AUC=1.0
- 可能原因：数据泄露或评估集与训练集重叠
- 解决：检查数据分割过程

交叉验证验证：

from sklearn.model_selection import cross_val_score from sklearn.ensemble import RandomForestClassifier # 假设X是特征矩阵 model = RandomForestClassifier() cv_auc = cross_val_score(model, X, y_true, scoring='roc_auc', cv=5) print(f"交叉验证AUC: {cv_auc.mean():.4f} (±{cv_auc.std():.4f})")

15. 从理论到实践：一个完整的工作流程示例

让我们通过一个模拟的客户流失预测案例，展示从数据准备到AUC计算的完整流程：

模拟生成数据：

import numpy as np from sklearn.datasets import make_classification # 生成模拟数据 X, y = make_classification( n_samples=1000, n_features=10, n_classes=2, weights=[0.9, 0.1], # 不平衡数据 random_state=42 ) # 模拟模型预测概率 np.random.seed(42) y_score = np.random.rand(len(y)) * 0.3 # 基础随机预测 y_score[y == 1] += np.random.rand(sum(y)) * 0.7 # 正类有更高分数 # 创建DataFrame并保存为Excel df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])]) df['label'] = y df['pred_score'] = y_score df.to_excel('churn_simulation.xlsx', index=False)

完整分析流程：

# 1. 读取数据 df = pd.read_excel('churn_simulation.xlsx') # 2. 数据检查 print(f"数据维度: {df.shape}") print(f"标签分布:\n{df['label'].value_counts()}") # 3. 计算AUC auc = roc_auc_score(df['label'], df['pred_score']) print(f"初始AUC: {auc:.4f}") # 4. 模型训练（对比） from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import roc_auc_score X_train, X_test, y_train, y_test = train_test_split( df.drop(['label', 'pred_score'], axis=1), df['label'], test_size=0.3, random_state=42 ) model = RandomForestClassifier(random_state=42) model.fit(X_train, y_train) # 5. 评估模型 test_probs = model.predict_proba(X_test)[:, 1] model_auc = roc_auc_score(y_test, test_probs) print(f"模型AUC: {model_auc:.4f}") # 6. 特征重要性分析 importances = pd.DataFrame({ 'feature': X_train.columns, 'importance': model.feature_importances_ }).sort_values('importance', ascending=False) print("\n特征重要性:") print(importances) # 7. 保存预测结果 df['model_pred_score'] = model.predict_proba( df.drop(['label', 'pred_score'], axis=1) )[:, 1] df.to_excel('churn_with_model_predictions.xlsx', index=False)

业务决策应用：

# 根据AUC和其他指标确定最佳阈值 from sklearn.metrics import precision_recall_curve precisions, recalls, thresholds = precision_recall_curve( y_test, test_probs ) # 寻找平衡精确率和召回率的阈值 f1_scores = 2 * (precisions * recalls) / (precisions + recalls) best_idx = np.argmax(f1_scores) best_threshold = thresholds[best_idx] print(f"最佳F1分数阈值: {best_threshold:.4f}") print(f"对应精确率: {precisions[best_idx]:.4f}") print(f"对应召回率: {recalls[best_idx]:.4f}") # 应用阈值到业务决策 def make_decision(score, threshold=best_threshold, cost_matrix=[10, 100]): """ cost_matrix: [误报成本, 漏报成本] """ if score >= threshold: # 预测为流失，采取挽留措施 return cost_matrix[0] # 误报成本 else: # 预测为不流失，可能漏报 return cost_matrix[1] # 漏报成本 # 评估业务成本 decisions = [make_decision(score) for score in test_probs] print(f"平均决策成本: {np.mean(decisions):.2f}")

16. 与其他工具的集成

AUC计算可以与其他数据科学工具无缝集成：

与MLflow集成（模型跟踪）：

import mlflow with mlflow.start_run(): # 记录参数 mlflow.log_param("model_type", "RandomForest") # 记录指标 mlflow.log_metric("AUC", model_auc) mlflow.log_metric("F1", f1_score( y_test, test_probs > best_threshold )) # 记录图表 plt.figure() RocCurveDisplay.from_predictions(y_test, test_probs) plt.title('ROC曲线') mlflow.log_figure(plt.gcf(), "roc_curve.png") plt.close() # 记录模型 mlflow.sklearn.log_model(model, "model")

与Airflow集成（自动化监控）：

from airflow import DAG from airflow.operators.python_operator import PythonOperator from datetime import datetime def monitor_auc(): # 这里放置我们的AUC计算和监控逻辑 pass dag = DAG( 'daily_auc_monitoring', schedule_interval='@daily', start_date=datetime(2023, 1, 1) ) monitor_task = PythonOperator( task_id='monitor_auc', python_callable=monitor_auc, dag=dag )

与FastAPI集成（实时服务）：

from fastapi import FastAPI from pydantic import BaseModel app = FastAPI() class PredictionInput(BaseModel): features: list @app.post("/predict") async def predict(input: PredictionInput): # 在实际应用中，这里会加载预训练模型 score = model.predict_proba([input.features])[0, 1] return {"prediction_score": score} @app.post("/evaluate") async def evaluate(y_true: list[int], y_score: list[float]): auc = roc_auc_score(y_true, y_score) return {"auc": auc}

17. 可视化增强：交互式AUC分析

使用Plotly创建交互式可视化：

import plotly.express as px from sklearn.metrics import roc_curve # 计算ROC曲线数据 fpr, tpr, _ = roc_curve(y_true, y_score) # 创建交互式ROC曲线 fig = px.line( x=fpr, y=tpr, title='ROC曲线 (AUC = {:.3f})'.format(roc_auc_score(y_true, y_score)), labels={'x': 'False Positive Rate', 'y': 'True Positive Rate'} ) # 添加对角线参考线 fig.add_shape( type='line', line=dict(dash='dash'), x0=0, x1=1, y0=0, y1=1 ) # 显示图表 fig.show() # 分数分布直方图 fig2 = px.histogram( x=y_score, color=y_true.astype(str), nbins=50, barmode='overlay', title='预测分数分布', labels={'x': '预测分数', 'color': '真实标签'} ) fig2.show()

高级可视化技巧：

阈值滑动条：

from ipywidgets import interact def plot_at_threshold(threshold): plt.figure(figsize=(10, 4)) # ROC曲线 plt.subplot(1, 2, 1) plt.plot(fpr, tpr, label='ROC曲线') plt.plot([0, 1], [0, 1], 'k--') plt.scatter( fpr[np.searchsorted(tpr, threshold)], threshold, c='red', s=100 ) plt.title(f'ROC曲线 (AUC={auc:.3f})') plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') # 混淆矩阵 plt.subplot(1, 2, 2) pred = y_score > threshold cm = confusion_matrix(y_true, pred) sns.heatmap(cm, annot=True, fmt='d') plt.title(f'阈值={threshold:.2f}时的混淆矩阵') plt.tight_layout() plt.show() interact( plot_at_threshold, threshold=(0.0, 1.0, 0.05) )

多模型对比：

from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC models = { 'Random Forest': RandomForestClassifier(random_state=42), 'Logistic Regression': LogisticRegression(max_iter=1000), 'SVM': SVC(probability=True) } plt.figure(figsize=(8, 6)) for name, model in models.items(): model.fit(X_train, y_train) if hasattr(model, 'predict_proba'): probs = model.predict_proba(X_test)[:, 1] else: probs = model.decision_function(X_test) fpr, tpr, _ = roc_curve(y_test, probs) auc = roc_auc_score(y_test, probs) plt.plot(fpr, tpr, label=f'{name} (AUC={auc:.3f})') plt.plot([0, 1], [0, 1], '

企业官网建设流程全解析