通透！过采样 vs 欠采样！！

哈喽，大家好~

今儿和大家聊聊过采样以及欠采样。

在二分类问题中，设训练样本为，其中，。

如果类别分布不平衡（例如正类占比很低），标准训练可能导致模型过分偏向多数类，从而在少数类上表现糟糕（召回率低、漏报严重）。

我们通常在经验风险最小化框架下训练模型：

其中为模型，为损失函数（如对数损失、hinge损失等）。

当类别不平衡时，该经验风险的每类样本贡献不均衡，模型倾向于优化多数类的错误率。

从贝叶斯决策出发，最优分类器最小化期望代价。

设代价矩阵为

将错误分到的代价为（false positive代价）；
将错误分到的代价为（false negative代价）。

贝叶斯决策规则（使用后验概率）为：

当下式成立时预测为1：

利用贝叶斯公式，，可得：

这意味着先验与代价比都会直接影响最优决策阈值。

在不平衡数据中，很小，直接训练的模型若近似后验，会偏向预测，导致少数类召回下降。

过采样 Oversampling

过采样通过 增加少数类样本的有效权重 来缓解不平衡。

实现方式会包括：

随机过采样：简单地复制少数类样本；
SMOTE：在少数类样本与其近邻之间插值生成合成样本；
ADASYN：自适应生成少数类样本，聚焦于困难区域；
Borderline-SMOTE等变种：在决策边界附近生成样本，强化边界学习。

核心思想是让少数类在经验风险中得到更多加权，从而影响模型的学习方向与边界。

风险加权的等价视角：

过采样（复制少数类样本）在许多凸模型中等价于使用样本权重。

将经验风险改写为加权形式：

若将所有少数类样本权重设为（相当于每条少数类样本复制次），渲染它们在优化中的影响力上升。

此加权与“随机过采样”的目标一致，但随机过采样对模型的方差与过拟合风险更敏感（特别是树模型与KNN）。

Logistic回归中的等价性与截距校正：

以对数损失为例，二分类Logistic模型的对数似然为：

如果对少数类进行倍过采样，相当于将少数类样本权重变为，则目标变为：

著名的“病例-对照抽样（case-control sampling）”结论指出：在Logistic回归中，只要抽样不改变的条件分布形状，斜率参数对特征的作用估计往往不受影响（或影响较小），但截距会因为先验比例变化而发生偏移，需后处理校正。

校正公式为：

其中是训练集中（过采样后）的类别先验，而是目标部署环境或测试集的真实先验。这说明过采样可能改变模型的概率校准（calibration），需要用真实先验调整截距，或在决策时改变阈值。

若过采样通过复制将正类比例从改变为（过采样后），且复制因子为，则有：

上式直接影响训练的“有效先验”，进而影响模型拟合的基准偏置。

SMOTE生成公式：

SMOTE通过在少数类样本及其近邻间插值：

其中为少数类样本，是其k近邻之一。

SMOTE增强了少数类的流形覆盖，在边界附近生成连续分布的合成点，减轻随机复制导致的过拟合。

过采样的侧重点与优缺点：

侧重点：

提升少数类召回（Recall）与F1；
改善决策边界对少数类的覆盖；
在保持多数类数据的情况下，改善少数类表征。

优点：

不削减多数类信息；
SMOTE类方法能在边界附近加密样本，利于非线性模型。

缺点：

随机过采样可能导致过拟合少数类（重复样本）；
生成样本方法需谨慎（可能引入噪声、不合理点）；
改变训练先验，可能损害概率校准，需要后续校正或阈值调节。

欠采样 Undersampling

欠采样通过“减少多数类样本的有效权重”来实现平衡。

常见方法包括：

随机欠采样：随机丢弃多数类样本；
NearMiss：保留与少数类距离更近的多数类样本；
Tomek Links：删去边界上的成对样本，清理噪声；
编辑法等：通过局部一致性剔除不代表多数类的点。

在加权风险视角下，欠采样相当于将多数类样本的权重降低或直接减少其数量。

欠采样对方差与边界的影响：

欠采样的优势是加速训练、缓解数据规模问题，同时更突出边界附近的样本（若使用NearMiss/Tomek等）。然而，减少多数类样本会增大估计方差，可能导致欠拟合或不稳定的边界，尤其当多数类中存在重要的结构信息时。

从优化角度看，欠采样改变了优化目标中的多数类分布，使得模型更重视少数类错误，但也可能丢失多数类的全局结构（例如多个簇或长尾模式），降低整体泛化性能。

欠采样的侧重点与优缺点：

侧重点：

快速训练，降低计算成本；
强化类间边界对称性；
在多数类冗余较多时，减少冗余并提升少数类表现。

优点：

简单高效，适合海量多数类场景；
边界清理方法（如Tomek）可以提升边界纯净度。

缺点：

舍弃多数类信息，增大估计方差，可能欠拟合；
若多数类结构复杂，欠采样会损害总体性能与稳定性。

过采样 vs 欠采样：核心区别与等价性

在凸模型（如Logistic回归）中，过采样与欠采样都可以视为“样本权重调整”，它们在理论上都相当于改变先验或代价结构（或者说改变经验风险的权重）。

但过采样倾向于保留多数类完整结构，欠采样倾向于牺牲多数类结构换取平衡。

两者都可能改变训练集的“有效先验比例”，导致概率校准的偏移。需要用真实先验或阈值后处理来校正。

在非线性模型（如树、核SVM、KNN）中，两者影响不再简单等价。过采样可能更易过拟合少数类（特别是随机复制），SMOTE类方法更合理但需调参。欠采样可能提升边界纯净度但损失多数类的多样性与全局结构。

算法侧重点：

过采样：保留多数类、增强少数类，偏重提高召回与边界覆盖；
欠采样：降低规模与冗余、加快训练，偏重边界平衡但牺牲多数类信息。

指标与采样策略的影响

不平衡场景下仅看准确率（Accuracy）不可靠。

常用指标定义：

准确率（Accuracy）：

召回率（Recall, TPR）：

精确率（Precision）：

F1：

平衡准确率（Balanced Accuracy）：

MCC（Matthews Correlation Coefficient）：

ROC曲线（TPR vs FPR），PR曲线（Precision vs Recall）：

ROC曲线衡量阈值变化下TPR与FPR关系，PR曲线更敏感于的极小先验情况，通常在强不平衡下更能反映少数类性能。

过采样往往提升Recall与F1，欠采样在某些情况下提升Balanced Accuracy，但若过度欠采样可能损害整体Acc与MCC。两者均会影响ROC/PR曲线形状与AUC。

完整案例

我们依旧使用虚拟数据集构造一个二分类、显著不平衡的任务。

我们代码中会训练三种模型：

基线模型（不采样）；
过采样（SMOTE）；
欠采样（RandomUndersampler）。

并在同一张图中，详细说明图的意义与要点。

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import roc_curve, auc, precision_recall_curve, average_precision_score, confusion_matrix
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
import seaborn as sns

# 1. 生成虚拟数据（不平衡）
X, y = make_classification(
    n_samples=2500,
    n_features=2,
    n_informative=2,
    n_redundant=0,
    n_clusters_per_class=2,
    weights=[0.90, 0.10],  # 强不平衡：90%多数类，10%少数类
    class_sep=1.25,        # 类间分离度
    flip_y=0.03,           # 加入噪声
    random_state=42
)

# 划分训练/测试
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.4, stratify=y, random_state=42
)

# 2. 标准化（先拟合训练集）
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

# 3. 三种训练集方案：原始、不采样；SMOTE过采样；随机欠采样
smote = SMOTE(sampling_strategy=0.5, k_neighbors=7, random_state=42)  # 将少数类提升到50%相对比例
rus = RandomUnderSampler(sampling_strategy=0.5, random_state=42)     # 将多数类降低到相对50%

X_train_sm, y_train_sm = smote.fit_resample(X_train_s, y_train)
X_train_ru, y_train_ru = rus.fit_resample(X_train_s, y_train)

# 4. 训练三个SVM模型（RBF核）
svc_base = SVC(kernel='rbf', probability=True, C=1.0, gamma='scale', random_state=42)
svc_smote = SVC(kernel='rbf', probability=True, C=1.0, gamma='scale', random_state=42)
svc_rus = SVC(kernel='rbf', probability=True, C=1.0, gamma='scale', random_state=42)

svc_base.fit(X_train_s, y_train)
svc_smote.fit(X_train_sm, y_train_sm)
svc_rus.fit(X_train_ru, y_train_ru)

# 5. 决策边界绘制网格（在原始坐标系构造网格后再标准化供预测）
x_min, x_max = X[:,0].min()-1.0, X[:,0].max()+1.0
y_min, y_max = X[:,1].min()-1.0, X[:,1].max()+1.0
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 400),
                     np.linspace(y_min, y_max, 400))
grid = np.c_[xx.ravel(), yy.ravel()]
grid_s = scaler.transform(grid)  # 标准化后用于模型预测

# 获取决策函数或预测概率用于等高线
Z_base = svc_base.decision_function(grid_s).reshape(xx.shape)
Z_smote = svc_smote.decision_function(grid_s).reshape(xx.shape)
Z_rus = svc_rus.decision_function(grid_s).reshape(xx.shape)

# 6. 测试集评估：ROC与PR
y_score_base = svc_base.predict_proba(X_test_s)[:,1]
y_score_smote = svc_smote.predict_proba(X_test_s)[:,1]
y_score_rus = svc_rus.predict_proba(X_test_s)[:,1]

fpr_base, tpr_base, _ = roc_curve(y_test, y_score_base)
fpr_smote, tpr_smote, _ = roc_curve(y_test, y_score_smote)
fpr_rus, tpr_rus, _ = roc_curve(y_test, y_score_rus)
auc_base = auc(fpr_base, tpr_base)
auc_smote = auc(fpr_smote, tpr_smote)
auc_rus = auc(fpr_rus, tpr_rus)

prec_base, rec_base, _ = precision_recall_curve(y_test, y_score_base)
prec_smote, rec_smote, _ = precision_recall_curve(y_test, y_score_smote)
prec_rus, rec_rus, _ = precision_recall_curve(y_test, y_score_rus)
ap_base = average_precision_score(y_test, y_score_base)
ap_smote = average_precision_score(y_test, y_score_smote)
ap_rus = average_precision_score(y_test, y_score_rus)

# 7. 构建可视化颜色与风格
cmap_bg = ListedColormap(["", ""])
colors_classes = {0: "", 1: ""}     # 多数类红、少数类蓝
boundary_colors = {"base": "", "smote": "", "rus": ""}

# 8. 将SMOTE与欠采样后的数据逆标准化以便在原始坐标系绘制
X_train_sm_orig = scaler.inverse_transform(X_train_sm)
X_train_ru_orig = scaler.inverse_transform(X_train_ru)

# 9. 可视化分析
plt.figure(figsize=(16, 12), dpi=120)

# 子图1：原始训练数据 + 基线模型决策边界
ax1 = plt.subplot(2, 2, 1)
ax1.contourf(xx, yy, Z_base > 0, alpha=0.12, cmap=cmap_bg)
ax1.contour(xx, yy, Z_base, levels=[0], colors=[boundary_colors["base"]], linewidths=2.5)

# 绘制原始训练集散点
mask0 = (y_train==0)
mask1 = (y_train==1)
ax1.scatter(X_train[mask0,0], X_train[mask0,1], s=18, c=colors_classes[0], label="Train Majority(0)", edgecolor="white", alpha=0.8)
ax1.scatter(X_train[mask1,0], X_train[mask1,1], s=30, c=colors_classes[1], label="Train Minority(1)", marker="^", edgecolor="black", alpha=0.9)

ax1.set_title("子图1：原始训练集 + 基线SVM决策边界", fontweight="bold")
ax1.set_xlabel("Feature 1")
ax1.set_ylabel("Feature 2")
ax1.legend(loc="best", frameon=True)

# 子图2：SMOTE过采样后数据 + 决策边界
ax2 = plt.subplot(2, 2, 2)
ax2.contourf(xx, yy, Z_smote > 0, alpha=0.12, cmap=cmap_bg)
ax2.contour(xx, yy, Z_smote, levels=[0], colors=[boundary_colors["smote"]], linewidths=2.5)

mask0_sm = (y_train_sm==0)
mask1_sm = (y_train_sm==1)
ax2.scatter(X_train_sm_orig[mask0_sm,0], X_train_sm_orig[mask0_sm,1], s=15, c=colors_classes[0], label="SMOTE Majority(0)", alpha=0.7)
ax2.scatter(X_train_sm_orig[mask1_sm,0], X_train_sm_orig[mask1_sm,1], s=25, c=colors_classes[1], label="SMOTE Minority(1)", marker="^", edgecolor="black", alpha=0.85)

ax2.set_title("子图2：SMOTE过采样后的训练集 + SVM决策边界", fontweight="bold")
ax2.set_xlabel("Feature 1")
ax2.set_ylabel("Feature 2")
ax2.legend(loc="best", frameon=True)

# 子图3：随机欠采样后数据 + 决策边界
ax3 = plt.subplot(2, 2, 3)
ax3.contourf(xx, yy, Z_rus > 0, alpha=0.12, cmap=cmap_bg)
ax3.contour(xx, yy, Z_rus, levels=[0], colors=[boundary_colors["rus"]], linewidths=2.5)

mask0_ru = (y_train_ru==0)
mask1_ru = (y_train_ru==1)
ax3.scatter(X_train_ru_orig[mask0_ru,0], X_train_ru_orig[mask0_ru,1], s=22, c=colors_classes[0], label="Under Majority(0)", edgecolor="white", alpha=0.8)
ax3.scatter(X_train_ru_orig[mask1_ru,0], X_train_ru_orig[mask1_ru,1], s=30, c=colors_classes[1], label="Under Minority(1)", marker="^", edgecolor="black", alpha=0.9)

ax3.set_title("子图3：随机欠采样后的训练集 + SVM决策边界", fontweight="bold")
ax3.set_xlabel("Feature 1")
ax3.set_ylabel("Feature 2")
ax3.legend(loc="best", frameon=True)

# 子图4：ROC与PR曲线对比（同一子图叠加两类曲线）
ax4 = plt.subplot(2, 2, 4)
# ROC曲线
ax4.plot(fpr_base, tpr_base, color=boundary_colors["base"], lw=2.0, label=f"Baseline ROC (AUC={auc_base:.3f})")
ax4.plot(fpr_smote, tpr_smote, color=boundary_colors["smote"], lw=2.0, label=f"SMOTE ROC (AUC={auc_smote:.3f})")
ax4.plot(fpr_rus, tpr_rus, color=boundary_colors["rus"], lw=2.0, label=f"Under ROC (AUC={auc_rus:.3f})")
ax4.plot([0,1],[0,1], color="", lw=1.0, linestyle="--", alpha=0.6)
ax4.set_xlim([0.0, 1.0])
ax4.set_ylim([0.0, 1.05])
ax4.set_xlabel("FPR")
ax4.set_ylabel("TPR")
ax4.set_title("子图4：ROC与PR曲线叠加对比", fontweight="bold")

# 使用副坐标轴叠加PR曲线
ax4_twin = ax4.twinx()
ax4_twin.plot(rec_base, prec_base, color=boundary_colors["base"], lw=2.0, linestyle=":", label=f"Baseline PR (AP={ap_base:.3f})")
ax4_twin.plot(rec_smote, prec_smote, color=boundary_colors["smote"], lw=2.0, linestyle=":", label=f"SMOTE PR (AP={ap_smote:.3f})")
ax4_twin.plot(rec_rus, prec_rus, color=boundary_colors["rus"], lw=2.0, linestyle=":", label=f"Under PR (AP={ap_rus:.3f})")
ax4_twin.set_ylim([0.0, 1.05])
ax4_twin.set_ylabel("Precision")

# 组合图例：把主轴与副轴的图例合并
lines_main, labels_main = ax4.get_legend_handles_labels()
lines_twin, labels_twin = ax4_twin.get_legend_handles_labels()
ax4.legend(lines_main + lines_twin, labels_main + labels_twin, loc="lower right", frameon=True)

plt.tight_layout()
plt.show()