哈喽,大家好~
今儿再来和大家聊聊Adam,在优化思想中,是非常重要的一块。
Adam(Adaptive Moment Estimation)的核心思想是把动量法对一阶矩(梯度均值)的平滑估计与RMSProp对二阶矩(梯度平方均值)的自适应缩放结合起来,从而在每个参数维度上动态地调整学习率,并用偏置修正让早期时刻的估计更加稳定。
Adam常用默认超参数为、、、。它在稀疏梯度、非平稳目标和噪声较大的场景下表现稳健。
底层逻辑
定义梯度为,其中是第个小批次的损失。
一阶矩动量(类似于动量法,平滑梯度的期望):
展开可得几何加权平均形式:
若是独立同分布且期望为,则有:
由于初始化为零导致早期时刻偏小,因此进行偏置修正:
二阶矩(RMSProp思想,平滑梯度平方的期望):
同样
偏置修正为
归一化更新(每维自适应缩放):当较大时进行更强的缩放,较小维度则放大步长,形成“自适应学习率”:
直观理解的话,相当于估计的平均梯度方向(带有记忆的方向平滑),近似梯度的均方根(RMS),用来“标准化”梯度的幅值,从而实现每个参数维度的步长自适应;避免除零并控制极小方差时的数值稳定。
与SGD和RMSProp的关系:
SGD+动量:,只有一阶矩的平滑,所有维度共用同一全局学习率。 RMSProp:,只有二阶矩的自适应缩放,没有一阶矩动量。 Adam结合两者,并通过偏置修正解决早期估计偏移问题。
收敛与实践要点:
控制动量记忆长度,控制二阶矩平滑程度;越大,步长越平稳但响应慢。 偏置修正在前期尤其重要;否则、会被初始化为0而严重低估。 权重衰减应采用解耦方式(AdamW):再做Adam更新,避免把L2正则项错误地混入梯度二阶矩。
完整案例
咱们用NumPy手写一个两层ReLU MLP在“双色螺旋”虚拟数据集上进行二分类,分别用SGD+Momentum、RMSProp、Adam训练,比较训练曲线与决策边界;
最后额外可视化Adam的与的动态。
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
np.random.seed(42)
# 数据集:双色螺旋
def make_spiral(n=2000, noise=0.20):
n_per = n // 2
theta = np.linspace(0.5, 4*np.pi, n_per)
r = theta
x0 = np.c_[r*np.cos(theta), r*np.sin(theta)] + noise*np.random.randn(n_per, 2)
x1 = np.c_[r*np.cos(theta + np.pi), r*np.sin(theta + np.pi)] + noise*np.random.randn(n_per, 2)
X = np.vstack([x0, x1]) / (4*np.pi) * 2.0
y = np.hstack([np.zeros(n_per), np.ones(n_per)])
idx = np.random.permutation(n)
return X[idx], y[idx]
X, y = make_spiral(n=2000, noise=0.25)
# 模型:两层ReLU MLP
def init_params(hidden=32):
params = {
'W1': np.random.randn(2, hidden) * 0.8,
'b1': np.zeros((1, hidden)),
'W2': np.random.randn(hidden, hidden) * 0.5,
'b2': np.zeros((1, hidden)),
'W3': np.random.randn(hidden, 1) * 0.2,
'b3': np.zeros((1, 1))
}
return params
def relu(x):return np.maximum(0, x)
def relu_grad(x):return (x > 0).astype(x.dtype)
def sigmoid(z):return1.0 / (1.0 + np.exp(-z))
def forward(params, X):
z1 = X @ params['W1'] + params['b1']
a1 = relu(z1)
z2 = a1 @ params['W2'] + params['b2']
a2 = relu(z2)
z3 = a2 @ params['W3'] + params['b3']
yhat = sigmoid(z3)
cache = {'X': X, 'z1': z1, 'a1': a1, 'z2': z2, 'a2': a2, 'z3': z3, 'yhat': yhat}
return yhat, cache
def bce_loss(yhat, y):
eps = 1e-8
y = y.reshape(-1, 1)
return -np.mean(y*np.log(yhat+eps) + (1-y)*np.log(1-yhat+eps))
def backward(params, cache, y):
N = cache['X'].shape[0]
y = y.reshape(-1, 1)
dz3 = (cache['yhat'] - y) # dL/dz3 = (sigmoid - y), 因为已经做了均值,稍后除以N
dW3 = (cache['a2'].T @ dz3) / N
db3 = np.sum(dz3, axis=0, keepdims=True) / N
da2 = dz3 @ params['W3'].T
dz2 = da2 * relu_grad(cache['z2'])
dW2 = (cache['a1'].T @ dz2) / N
db2 = np.sum(dz2, axis=0, keepdims=True) / N
da1 = dz2 @ params['W2'].T
dz1 = da1 * relu_grad(cache['z1'])
dW1 = (cache['X'].T @ dz1) / N
db1 = np.sum(dz1, axis=0, keepdims=True) / N
grads = {'W1': dW1, 'b1': db1, 'W2': dW2, 'b2': db2, 'W3': dW3, 'b3': db3}
return grads
def copy_params(params):
return {k: v.copy() for k, v in params.items()}
def flatten_dict(d):
return np.concatenate([v.ravel() for v in d.values()])
# 优化器实现
class OptimizerSGDM:
def __init__(self, params, lr=0.05, momentum=0.9):
self.lr = lr
self.momentum = momentum
self.v = {k: np.zeros_like(v) for k, v in params.items()}
def step(self, params, grads):
upd_mag = []
for k in params:
self.v[k] = self.momentum * self.v[k] + grads[k]
update = self.lr * self.v[k]
params[k] -= update
upd_mag.append(np.mean(np.abs(update)))
return np.mean(upd_mag)
class OptimizerRMSProp:
def __init__(self, params, lr=0.003, beta2=0.999, eps=1e-8):
self.lr = lr
self.beta2 = beta2
self.eps = eps
self.v = {k: np.zeros_like(v) for k, v in params.items()}
def step(self, params, grads):
upd_mag = []
for k in params:
self.v[k] = self.beta2 * self.v[k] + (1 - self.beta2) * (grads[k] ** 2)
update = self.lr * grads[k] / (np.sqrt(self.v[k]) + self.eps)
params[k] -= update
upd_mag.append(np.mean(np.abs(update)))
return np.mean(upd_mag)
class OptimizerAdam:
def __init__(self, params, lr=0.005, beta1=0.9, beta2=0.999, eps=1e-8,
track_key=None, track_index=(0,0)):
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.eps = eps
self.m = {k: np.zeros_like(v) for k, v in params.items()}
self.v = {k: np.zeros_like(v) for k, v in params.items()}
self.t = 0
# 追踪某个参数的m,v与修正后的值
self.track_key = track_key
self.track_index = track_index
self.m_trace = []
self.mhat_trace = []
self.v_trace = []
self.vhat_trace = []
self.eff_lr_trace = []
def step(self, params, grads):
self.t += 1
upd_mag = []
for k in params:
self.m[k] = self.beta1 * self.m[k] + (1 - self.beta1) * grads[k]
self.v[k] = self.beta2 * self.v[k] + (1 - self.beta2) * (grads[k] ** 2)
mhat = self.m[k] / (1 - self.beta1 ** self.t)
vhat = self.v[k] / (1 - self.beta2 ** self.t)
eff_lr = self.lr / (np.sqrt(vhat) + self.eps) # 每维的有效学习率(未乘mhat)
update = self.lr * mhat / (np.sqrt(vhat) + self.eps)
params[k] -= update
upd_mag.append(np.mean(np.abs(update)))
if self.track_key == k:
idx = self.track_index
self.m_trace.append(self.m[k][idx])
self.mhat_trace.append(mhat[idx])
self.v_trace.append(self.v[k][idx])
self.vhat_trace.append(vhat[idx])
self.eff_lr_trace.append(eff_lr[idx])
return np.mean(upd_mag)
# 训练过程
def train_model(opt_class, init_params, X, y, epochs=60, batch_size=64, opt_kwargs=None):
params = copy_params(init_params)
optimizer = opt_class(params, **(opt_kwargs or {}))
n = X.shape[0]
logs = {'loss': [], 'acc': [], 'grad_norm': [], 'mean_update': []}
for ep in range(epochs):
idx = np.random.permutation(n)
Xs, ys = X[idx], y[idx]
for i in range(0, n, batch_size):
xb = Xs[i:i+batch_size]; yb = ys[i:i+batch_size]
yhat, cache = forward(params, xb)
grads = backward(params, cache, yb)
# 梯度范数
grad_norm = np.linalg.norm(flatten_dict(grads))
mu = optimizer.step(params, grads)
# 一个epoch结束,用全数据评估
yhat_all, _ = forward(params, X)
loss = bce_loss(yhat_all, y)
pred = (yhat_all.reshape(-1) > 0.5).astype(int)
acc = np.mean(pred == y)
logs['loss'].append(loss)
logs['acc'].append(acc)
logs['grad_norm'].append(grad_norm)
logs['mean_update'].append(mu)
# 打印可选:注释以保持整洁
# print(f"Epoch {ep+1}: loss={loss:.4f}, acc={acc:.3f}")
return params, optimizer, logs
# 初始化参数一次,复制给不同优化器以保证公平
init_params = init_params(hidden=32)
# 训练三种优化器
params_sgd, opt_sgd, logs_sgd = train_model(OptimizerSGDM, init_params, X, y, epochs=70,
opt_kwargs={'lr':0.05,'momentum':0.9})
params_rms, opt_rms, logs_rms = train_model(OptimizerRMSProp, init_params, X, y, epochs=70,
opt_kwargs={'lr':0.003,'beta2':0.999,'eps':1e-8})
params_adam, opt_adam, logs_adam = train_model(OptimizerAdam, init_params, X, y, epochs=70,
opt_kwargs={'lr':0.005,'beta1':0.9,'beta2':0.999,'eps':1e-8,
'track_key':'W1','track_index':(0,0)})
# 图形1:训练曲线对比(复杂多面板)
epochs = np.arange(1, len(logs_adam['loss'])+1)
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
ax1, ax2, ax3, ax4 = axes.ravel()
# 面板A:Loss曲线
ax1.plot(epochs, logs_sgd['loss'], color='', label='SGD+Momentum', marker='o', alpha=0.85)
ax1.plot(epochs, logs_rms['loss'], color='', label='RMSProp', marker='s', alpha=0.85)
ax1.plot(epochs, logs_adam['loss'], color='', label='Adam', marker='^', alpha=0.95)
ax1.set_title('训练损失(BCE)', color='')
ax1.set_xlabel('Epoch'); ax1.set_ylabel('Loss'); ax1.grid(True, linestyle='--', alpha=0.35)
ax1.legend(loc='upper right')
# 面板B:Accuracy曲线
ax2.plot(epochs, logs_sgd['acc'], color='', label='SGD+Momentum', alpha=0.85)
ax2.plot(epochs, logs_rms['acc'], color='', label='RMSProp', alpha=0.85)
ax2.plot(epochs, logs_adam['acc'], color='', label='Adam', alpha=0.95)
ax2.set_title('训练精度', color='')
ax2.set_xlabel('Epoch'); ax2.set_ylabel('Accuracy'); ax2.grid(True, linestyle=':', alpha=0.35)
ax2.legend(loc='lower right')
# 面板C:梯度范数
ax3.plot(epochs, logs_sgd['grad_norm'], color='', label='SGD+Momentum', alpha=0.85)
ax3.plot(epochs, logs_rms['grad_norm'], color='', label='RMSProp', alpha=0.85)
ax3.plot(epochs, logs_adam['grad_norm'], color='', label='Adam', alpha=0.95)
ax3.set_title('每Epoch末的梯度L2范数', color='')
ax3.set_xlabel('Epoch'); ax3.set_ylabel('||grad||'); ax3.grid(True, linestyle='-.', alpha=0.35)
ax3.legend(loc='upper right')
# 面板D:平均更新幅度(有效步长)
ax4.plot(epochs, logs_sgd['mean_update'], color='', label='SGD+Momentum', alpha=0.85)
ax4.plot(epochs, logs_rms['mean_update'], color='', label='RMSProp', alpha=0.85)
ax4.plot(epochs, logs_adam['mean_update'], color='', label='Adam', alpha=0.95)
ax4.set_title('平均参数更新幅度', color='')
ax4.set_xlabel('Epoch'); ax4.set_ylabel('Mean |Δθ|'); ax4.grid(True, linestyle='--', alpha=0.35)
ax4.legend(loc='upper right')
plt.tight_layout()
plt.show()
# 图形2:决策边界对比(彩色背景+散点覆盖)
def predict_prob(params, X):
yhat, _ = forward(params, X)
return yhat.reshape(-1)
# 网格
x_min, x_max = X[:,0].min()-0.5, X[:,0].max()+0.5
y_min, y_max = X[:,1].min()-0.5, X[:,1].max()+0.5
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 400), np.linspace(y_min, y_max, 400))
grid = np.c_[xx.ravel(), yy.ravel()]
Z_sgd = predict_prob(params_sgd, grid).reshape(xx.shape)
Z_rms = predict_prob(params_rms, grid).reshape(xx.shape)
Z_adam = predict_prob(params_adam, grid).reshape(xx.shape)
fig2, axs = plt.subplots(1, 3, figsize=(18, 6))
cmaps = ['plasma', 'inferno', 'viridis']
titles = ['SGD+Momentum 决策边界', 'RMSProp 决策边界', 'Adam 决策边界']
Zs = [Z_sgd, Z_rms, Z_adam]
for i, ax in enumerate(axs):
im = ax.imshow(Zs[i], origin='lower', extent=(x_min, x_max, y_min, y_max),
cmap=cmaps[i], alpha=0.85, vmin=0, vmax=1, interpolation='bilinear')
ax.contour(xx, yy, Zs[i], levels=[0.5], colors=[''], linewidths=2.0)
c0 = (y==0)
c1 = (y==1)
ax.scatter(X[c0,0], X[c0,1], s=16, c='', edgecolors='', alpha=0.9, label='Class 0')
ax.scatter(X[c1,0], X[c1,1], s=16, c='', edgecolors='', alpha=0.9, label='Class 1')
ax.set_title(titles[i], color='')
ax.set_xlim(x_min, x_max); ax.set_ylim(y_min, y_max)
ax.grid(True, linestyle=':', alpha=0.3)
ax.legend(loc='lower right', fontsize=9)
fig2.colorbar(im, ax=ax, fraction=0.046, pad=0.04)
plt.tight_layout()
plt.show()
# 图形3:Adam内部动态(m与v的偏置修正对比)
m_series = np.array(opt_adam.m_trace).reshape(-1)
mhat_series = np.array(opt_adam.mhat_trace).reshape(-1)
v_series = np.array(opt_adam.v_trace).reshape(-1)
vhat_series = np.array(opt_adam.vhat_trace).reshape(-1)
eff_lr_series = np.array(opt_adam.eff_lr_trace).reshape(-1)
steps = np.arange(1, len(m_series)+1)
fig3, axx = plt.subplots(3, 1, figsize=(12, 12), sharex=True)
axx[0].plot(steps, m_series, color='', label='m_t(未修正)', alpha=0.8)
axx[0].plot(steps, mhat_series, color='', label='m̂_t(偏置修正)', alpha=0.8)
axx[0].set_ylabel('m vs m̂'); axx[0].set_title('Adam 一阶矩动量的偏置修正动态', color='')
axx[0].grid(True, linestyle='--', alpha=0.3); axx[0].legend(loc='upper right')
axx[1].plot(steps, v_series, color='', label='v_t(未修正)', alpha=0.8)
axx[1].plot(steps, vhat_series, color='', label='v̂_t(偏置修正)', alpha=0.8)
axx[1].set_ylabel('v vs v̂'); axx[1].set_title('Adam 二阶矩的偏置修正动态', color='')
axx[1].grid(True, linestyle='-.', alpha=0.3); axx[1].legend(loc='upper right')
axx[2].plot(steps, eff_lr_series, color='', label='每步有效学习率 α/√(v̂)+ε', alpha=0.9)
axx[2].set_xlabel('Optimization Steps'); axx[2].set_ylabel('Eff. LR')
axx[2].set_title('Adam 每维有效学习率(选定参数维度)', color='')
axx[2].grid(True, linestyle=':', alpha=0.3); axx[2].legend(loc='upper right')
plt.tight_layout()
plt.show()
大家可以看到,整个代码中,我们实现了三种优化器的核心更新,Adam包含偏置修正,并额外记录了某个参数维度的与有效学习率。

图形1(四面板)展示了三者训练损失、精度、梯度范数以及平均更新幅度的对比,能直观看到Adam的稳定且较快的收敛过程。

图形2(决策边界)用彩色概率背景和等值线叠加原始数据点,显示不同优化器最终的分类边界形状与贴合度,Adam通常更平滑、对噪声更稳健。

图形3展示了偏置修正前后与的差异,早期阶段未修正值显著偏小;修正后能更准确反映统计量,使有效学习率曲线更合理,体现Adam前期数值稳定性的来源。
总结
整体而言,Adam其实就是把“动量法”和“RMSProp”这俩思路聪明地揉在一起,既记得方向、又能自己调学习率。
它在训练初期会做偏置修正,防止数值太小乱跳,从而收敛得又稳又快。
最后

