AMS机器学习课程:预测雷暴旋转的基础机器学习 - 线性回归

目录

本文翻译自 AMS 机器学习 Python 教程,并有部分修改。

Lagerquist, R., and D.J. Gagne II, 2019: “Basic machine learning for predicting thunderstorm rotation: Python tutorial”. https://github.com/djgagne/ams-ml-python-course/blob/master/module_2/ML_Short_Course_Module_2_Basic.ipynb.

本文接上一篇文章

AMS机器学习课程:预测雷暴旋转的基础机器学习 - 数据

线性回归

线性回归将以下方程与训练数据拟合

$$ \hat{y} = \beta_0 + \sum\limits_{j = 1}^{M} \beta_j x_j $$

  • x_j 是第 j 个预测变量
  • beta_j 是第 j 个预测变量的系数,在训练过程中会被调整
  • M 是预测变量的数目
  • beta_0 是偏置系数或截距,在训练过程中会被调整
  • hat{y} 是对目标变量的预测,在本例中,是未来风暴中最大的涡度,单位是每秒

权重(beta_0 和 beta_j)经过训练以使均方误差(Mean Squared Error,MSE)最小。 这也是为什么线性回归会被称为“最小二乘线性回归”。

\begin{equation*} \textrm{MSE} = \frac{1}{N} \sum\limits_{i = 1}^{N} (\hat{y}_i - y_i)^2 \end{equation*}

  • y_i 是第 i 个样本的实际目标值。在本例中,一个示例是一次一个风暴单元,或者一个“风暴对象”
  • hat{y_i} 是第 i 个样本的预测目标值
  • N 是训练样本的数量

将上面两个等值合并可以得到下面的等式,其中 x_{ij} 是第 j 个预测变量的第 i 个样本

\begin{equation*} \textrm{MSE} = \frac{1}{N} \sum\limits_{i = 1}^{N} (\beta_0 + \sum\limits_{j = 1}^{M} \beta_j x_{ij} - y_i)^2 \end{equation*}

模型系数与 MSE 的导数如下:

\begin{equation*} \frac{\partial}{\partial \beta_0}(\textrm{MSE}) = \frac{2}{N} \sum\limits_{i = 1}^{N} (\hat{y}_i - y_i) \end{equation*}

\begin{equation*} \frac{\partial}{\partial \beta_j}(\textrm{MSE}) = \frac{2}{N} \sum\limits_{i = 1}^{N} x_{ij} (\hat{y}_i - y_i) \end{equation*}

在训练期间,权重(beta_0 和 beta_j)经过多次迭代调整。

每次迭代后,将应用“梯度下降规则”(如下所示),其中 alpha 在 (0, 1] 之间,表示学习率。

\begin{equation*} \beta_0 \leftarrow \beta_0 - \alpha \frac{\partial}{\partial \beta_0}(\textrm{MSE}) \end{equation*}

\begin{equation*} \beta_j \leftarrow \beta_j - \alpha \frac{\partial}{\partial \beta_j}(\textrm{MSE}) \end{equation*}

线性回归:示例

下一个单元格执行以下操作:

  • 训练线性回归模型(具有默认的超参数),以预测每次暴风雨中未来的最大旋转量
  • 在训练和验证数据集上评估模型

对于训练和验证数据,本教程汇报以下统计指标:

平均绝对误差(Mean Abosolute Error,MAE)

\begin{equation*} \frac{1}{N} \sum\limits_{i = 1}^{N} \lvert \hat{y}_i - y_i \rvert \end{equation*}

均方误差(Mean Squared Error,MSE)

\begin{equation*} \frac{1}{N} \sum\limits_{i = 1}^{N} (\hat{y}_i - y_i)^2 \end{equation*}

偏差(Mean signed error,bias)

\begin{equation*} \frac{1}{N} \sum\limits_{i = 1}^{N} (\hat{y}_i - y_i) \end{equation*}

MAE 技巧评分。定义如下,其中 MAE 是模型的 MAE,MAE_{climo} 是通过始终预测“气候”(训练数据中的平均值)而获得的 MAE。

\begin{equation*} \textrm{MAE skill score} = \frac{\textrm{MAE}{\textrm{climo}} - \textrm{MAE}}{\textrm{MAE}{\textrm{climo}}} \end{equation*}

MSE 技巧评分。定义如下

\begin{equation*} \textrm{MSE skill score} = \frac{\textrm{MSE}{\textrm{climo}} - \textrm{MSE}}{\textrm{MSE}{\textrm{climo}}} \end{equation*}

最后,此单元格绘制了一条可靠性曲线(reliability curve),该曲线显示了每个预测值的条件平均观察值。 这使您可以确定条件偏差(某些预测值发生的偏差)。

创建线性回归模型对象

其中 labmda1 和 lambda2 表示 L1 和 L2 正则化权重,根据不同的设置返回四种模型:

  • LinearRegression:线性回归
  • Ridge:岭回归
  • Lasso:Lasso 回归
  • ElasticNet:弹性网络回归
# 忽略较小的系数
LAMBDA_TOLERANCE = 1e-10

# 示例使用固定的随机数种子,尽量保证每次运行结果一致
RANDOM_SEED = 6695

# 生成用于 Elastic Net 回归模型的系数
def _lambdas_to_sklearn_inputs(lambda1, lambda2):
    return lambda1 + lambda2, lambda1 / (lambda1 + lambda2)

def setup_linear_regression(lambda1=0., lambda2=0.):
    assert lambda1 >= 0
    assert lambda2 >= 0
    
    # 不使用正则化返回线性回归对象
    if lambda1 < LAMBDA_TOLERANCE and lambda2 < LAMBDA_TOLERANCE:
        return linear_model.LinearRegression(
            fit_intercept=True, normalize=False)
    
    # 仅使用 L2 正则化,返回 Ridge 回归对象
    if lambda1 < LAMBDA_TOLERANCE:
        return linear_model.Ridge(
            alpha=lambda2, fit_intercept=True, normalize=False,
            random_state=RANDOM_SEED)
    
    # 仅使用 L1 正则化,返回 Lasso 回归对象
    if lambda2 < LAMBDA_TOLERANCE:
        return linear_model.Lasso(
            alpha=lambda1, fit_intercept=True, normalize=False,
            random_state=RANDOM_SEED)
    
    # 同时使用 L1 和 L2 正则化,返回 ElasticNet 回归对象
    alpha, l1_ratio = _lambdas_to_sklearn_inputs(
        lambda1=lambda1, lambda2=lambda2)
    
    return linear_model.ElasticNet(
        alpha=alpha, 
        l1_ratio=l1_ratio, 
        fit_intercept=True, 
        normalize=False,
        random_state=RANDOM_SEED
    )

生成一个线性回归模型对象

linreg_model_object = setup_linear_regression(
    lambda1=0.,
    lambda2=0.,
)
linreg_model_object
LinearRegression()

训练模型

train_linear_regression 函数用于训练线性模型

注:scikit-learn 的模型对象都有 fit 方法用于训练。

def train_linear_regression(
    model, 
    training_predictor_table: pd.DataFrame,
    training_target_table: pd.DataFrame
):
    model.fit(
        X=training_predictor_table.values,
        y=training_target_table[TARGET_NAME].values
    )

    return model

使用训练集训练模型

_ = train_linear_regression(
    model=linreg_model_object,
    training_predictor_table=training_predictor_table,
    training_target_table=training_target_table,
)

预测结果

计算预测目标

注:scikit-learn 的模型对象都有 predict 方法用于计算预测结果

training_predictions = linreg_model_object.predict(
    training_predictor_table.values
)

计算目标变量的均值

mean_training_target_value = np.mean(
    training_target_table[TARGET_NAME].values
)
mean_training_target_value
0.001910818440106315

评估模型

使用到的函数

evluate_regression 函数用于评估回归模型,输出统计指标并绘制图形。

MAE_KEY = 'mean_absolute_error'
MSE_KEY = 'mean_squared_error'
MEAN_BIAS_KEY = 'mean_bias'
MAE_SKILL_SCORE_KEY = 'mae_skill_score'
MSE_SKILL_SCORE_KEY = 'mse_skill_score'


def evaluate_regression(
    target_values:np.ndarray, 
    predicted_target_values:np.ndarray, 
    mean_training_target_value:float,
    verbose=True, 
    create_plots=True, 
    dataset_name=None
):
    signed_errors = predicted_target_values - target_values
    
    # 平均偏差
    mean_bias = np.mean(signed_errors)
    
    # 平均绝对误差
    mean_absolute_error = np.mean(np.absolute(signed_errors))
    
    # 均方误差
    mean_squared_error = np.mean(signed_errors ** 2)

    climo_signed_errors = mean_training_target_value - target_values
    climo_mae = np.mean(np.absolute(climo_signed_errors))
    climo_mse = np.mean(climo_signed_errors ** 2)
    
    # MAE 技巧评分,[-1, 1] 之间,相对于气候场的改进
    mae_skill_score = (climo_mae - mean_absolute_error) / climo_mae
    
    # MSE 技巧评分,[-1, 1] 之间,相对于气候场的改进
    mse_skill_score = (climo_mse - mean_squared_error) / climo_mse

    evaluation_dict = {
        MAE_KEY: mean_absolute_error,
        MSE_KEY: mean_squared_error,
        MEAN_BIAS_KEY: mean_bias,
        MAE_SKILL_SCORE_KEY: mae_skill_score,
        MSE_SKILL_SCORE_KEY: mse_skill_score
    }

    if verbose or create_plots:
        dataset_name = dataset_name[0].upper() + dataset_name[1:]
    
    # 输出统计结果
    if verbose:
        print(
            f'{dataset_name} MAE (mean absolute error) '
            f'= {evaluation_dict[MAE_KEY]:.3e} s^-1')
        print(
            f'{dataset_name} MSE (mean squared error) '
            f'= {evaluation_dict[MSE_KEY]:.3e} s^-2'
        )
        print(
            f'{dataset_name} bias (mean signed error) '
            f'= {evaluation_dict[MEAN_BIAS_KEY]:.3e} s^-1'
        )

        print(
            f'{dataset_name} MAE skill score (improvement over climatology) '
            f'= {evaluation_dict[MAE_SKILL_SCORE_KEY]:.3f}'
        )

        print(
            f'{dataset_name} MSE skill score (improvement over climatology) '
            f'= {evaluation_dict[MSE_SKILL_SCORE_KEY]:.3f}'
        )

    if not create_plots:
        return evaluation_dict
    
    # 绘制统计图形
    figure_object, axes_object = pyplot.subplots(
        1, 1, 
        figsize=(10, 10)
    )

    plot_regression_relia_curve(
        observed_values=target_values, 
        forecast_values=predicted_target_values,
        num_bins=20, 
        figure_object=figure_object, 
        axes_object=axes_object
    )

    axes_object.set_xlabel(r'Forecast value (s$^{-1}$)')
    axes_object.set_ylabel(r'Conditional mean observation (s$^{-1}$)')

    title_string = f'{dataset_name} reliability curve for max future vorticity'
    axes_object.set_title(title_string)
    pyplot.show()

    return evaluation_dict

画图相关函数

plot_regression_relia_curve 函数用于绘制线性回归的可靠性曲线

DEFAULT_NUM_BINS = 20

def plot_regression_relia_curve(
    observed_values, 
    forecast_values, 
    num_bins: int=DEFAULT_NUM_BINS,
    figure_object=None, 
    axes_object=None
):
    # 获取可靠性曲线需要的数据:
    #     预报数据在每个桶内的平均值
    #     观测数据在每个桶内的平均值
    #     每个桶内样本的个数
    mean_forecast_by_bin, mean_observation_by_bin, num_examples_by_bin = (
        _get_points_in_regression_relia_curve(
            observed_values=observed_values, 
            forecast_values=forecast_values,
            num_bins=num_bins)
    )

    if figure_object is None or axes_object is None:
        figure_object, axes_object = pyplot.subplots(
            1, 1, 
            figsize=(10, 10)
        )
    
    # 绘制预报数据每个桶均值的直方图
    _plot_forecast_hist_for_regression(
        figure_object=figure_object, 
        mean_forecast_by_bin=mean_forecast_by_bin,
        num_examples_by_bin=num_examples_by_bin)

    max_forecast_or_observed = max([
        np.max(forecast_values), 
        np.max(observed_values)
    ])
    
    # 绘制完美情况:y = x 的直线
    perfect_x_coords = np.array(
        [0., max_forecast_or_observed]
    )
    perfect_y_coords = perfect_x_coords + 0.
    axes_object.plot(
        perfect_x_coords, 
        perfect_y_coords, 
        color=np.full(3, 152. / 255),
        linestyle='dashed', 
        linewidth=2,
    )

    # 绘制真实情况
    # 获取非 NaN 值的索引,仅绘制有效值
    real_indices = np.where(
        np.invert(
            np.logical_or(
                np.isnan(mean_forecast_by_bin), 
                np.isnan(mean_observation_by_bin),
            )
        )
    )[0]

    axes_object.plot(
        mean_forecast_by_bin[real_indices],
        mean_observation_by_bin[real_indices],
        color=np.array([228, 26, 28], dtype=float) / 255,
        linestyle='solid', 
        linewidth=3
    )

    axes_object.set_xlabel('Forecast value')
    axes_object.set_ylabel('Conditional mean observation')
    axes_object.set_xlim(0., max_forecast_or_observed)
    axes_object.set_ylim(0., max_forecast_or_observed)

注:

  • np.logical_or:逻辑或
  • np.invert:取非

_get_points_in_regression_relia_curve 函数创建可靠性曲线需要的数据点

def _get_points_in_regression_relia_curve(
    observed_values: np.ndarray, 
    forecast_values: np.ndarray,
    num_bins: int,
):
    # 预测数据的桶序号
    inputs_to_bins = _get_histogram(
        input_values=forecast_values, 
        num_bins=num_bins,
        min_value=np.min(forecast_values),
        max_value=np.max(forecast_values)
    )
    
    # 预报数据在每个桶的均值
    mean_forecast_by_bin = np.full(num_bins, np.nan)
    
    # 每个桶对应的观测值的均值
    mean_observation_by_bin = np.full(num_bins, np.nan)
    
    # 每个桶的样本个数
    num_examples_by_bin = np.full(num_bins, -1, dtype=int)

    for k in range(num_bins):
        # 属于桶 k 的预报数据的索引
        these_example_indices = np.where(inputs_to_bins == k)[0]
        
        num_examples_by_bin[k] = len(these_example_indices)

        mean_forecast_by_bin[k] = np.mean(
            forecast_values[these_example_indices]
        )

        mean_observation_by_bin[k] = np.mean(
            observed_values[these_example_indices]
        )

    return (
        mean_forecast_by_bin, 
        mean_observation_by_bin, 
        num_examples_by_bin,
    )

_get_histogram 函数用于计算直方图数据,返回与输入数据相同的数组,每个值表示数据桶的序号。

def _get_histogram(
    input_values: np.ndarray, 
    num_bins: int, 
    min_value: float, 
    max_value: float,
) -> np.array:
    bin_cutoffs = np.linspace(
        min_value, 
        max_value, 
        num=num_bins + 1
    )

    inputs_to_bins = np.digitize(
        input_values, 
        bin_cutoffs, 
        right=False
    ) - 1

    inputs_to_bins[inputs_to_bins < 0] = 0
    inputs_to_bins[inputs_to_bins > num_bins - 1] = num_bins - 1

    return inputs_to_bins

注:

np.digitize 函数返回输入数组中每个值所属的 bin 的索引。

默认情况下 bins[i-1] <= x < bins[i]

如果 x 超过 bins 的范围,返回 0 或者 len(bins)

_plot_forecast_hist_for_regression 在可靠性曲线图中绘制右下角的直方图

def _plot_forecast_hist_for_regression(
    figure_object, 
    mean_forecast_by_bin: np.ndarray, 
    num_examples_by_bin: np.ndarray,
):
    # 每个桶的概率
    bin_frequencies = (
        num_examples_by_bin.astype(float) / np.sum(num_examples_by_bin)
    )

    num_bins = len(num_examples_by_bin)
    forecast_bin_width = (
        (np.max(mean_forecast_by_bin) - np.min(mean_forecast_by_bin)) /
        (num_bins - 1)
    )

    inset_axes_object = figure_object.add_axes([
        0.575, 
        0.225,
        0.3, 
        0.25
    ])
    
    # 绘制条形图
    inset_axes_object.bar(
        mean_forecast_by_bin, 
        bin_frequencies, 
        forecast_bin_width,
        color=np.array([228, 26, 28], dtype=float) / 255, 
        edgecolor=np.full(3, 0.),
        linewidth=2
    )
    
    # 坐标轴刻度
    max_y_tick_value = _floor_to_nearest(
        1.05 * np.max(bin_frequencies), 
        0.1
    )
    num_y_ticks = 1 + int(np.round(
        max_y_tick_value / 0.1
    ))

    y_tick_values = np.linspace(
        0, max_y_tick_value, 
        num=num_y_ticks
    )
    pyplot.yticks(
        y_tick_values, 
        axes=inset_axes_object)
    pyplot.xticks(
        np.linspace(0, 0.02, num=11), 
        axes=inset_axes_object,
        rotation=90.
    )
    
    # 坐标轴范围
    inset_axes_object.set_xlim(
        0, 
        np.max(mean_forecast_by_bin) + forecast_bin_width
    )
    inset_axes_object.set_ylim(
        0, 
        1.05 * np.max(bin_frequencies)
    )

_floor_to_nearest 将数字向下舍入到 increment 的最近倍数

def _floor_to_nearest(
    input_value_or_array, 
    increment,
):
    return increment * np.floor(input_value_or_array / increment)

绘制可靠性曲线

训练集

_ = evaluate_regression(
    target_values=training_target_table[TARGET_NAME].values,
    predicted_target_values=training_predictions,
    mean_training_target_value=mean_training_target_value,
    dataset_name='training'
)
Training MAE (mean absolute error) = 7.714e-04 s^-1
Training MSE (mean squared error) = 1.112e-06 s^-2
Training bias (mean signed error) = 1.846e-19 s^-1
Training MAE skill score (improvement over climatology) = 0.315
Training MSE skill score (improvement over climatology) = 0.521

验证集

validation_predictions = linreg_model_object.predict(
    validation_predictor_table.values
)

_ = evaluate_regression(
    target_values=validation_target_table[TARGET_NAME].values,
    predicted_target_values=validation_predictions,
    mean_training_target_value=mean_training_target_value,
    dataset_name='validation'
)
Validation MAE (mean absolute error) = 7.505e-04 s^-1
Validation MSE (mean squared error) = 1.048e-06 s^-2
Validation bias (mean signed error) = -9.007e-07 s^-1
Validation MAE skill score (improvement over climatology) = 0.316
Validation MSE skill score (improvement over climatology) = 0.540

线性回归:系数

下一个单元将为线性回归模型绘制系数。 如果预测变量 x_{j} 具有正(负)系数,则预测值随着 x_{j} 增大(减小)。

请记住,所有预测变量均已归一化为相同的标度(z-score),因此通常具有较大系数的预测变量更为重要。

另外,请注意,模型使用了每个预测变量(系数为非零)。

使用到的函数

def plot_model_coefficients(
    model, 
    predictor_names
):
    # 获取系数
    coefficients = model.coef_
    num_dimensions = len(coefficients.shape)
    if num_dimensions > 1:
        coefficients = coefficients[0, ...]
    
    # 根据输入变量数确定 y 轴坐标
    num_predictors = len(predictor_names)
    y_coords = np.linspace(
        0, 
        num_predictors - 1, 
        num=num_predictors, 
        dtype=float
    )

    _, axes_object = pyplot.subplots(
        1, 1, 
        figsize=(10, 10)
    )
    
    # 绘制条形图
    axes_object.barh(
        y_coords, 
        coefficients, 
        color=np.array([27, 158, 119], dtype=float) / 255,
        edgecolor=np.array([27, 158, 119], dtype=float) / 255, 
        linewidth=2
    )

    pyplot.xlabel('Coefficient')
    pyplot.ylabel('Predictor variable')

    pyplot.yticks([], [])
    x_tick_values, _ = pyplot.xticks()
    pyplot.xticks(x_tick_values, rotation=90)
    
    # 设置 x 轴范围
    x_min = np.percentile(coefficients, 1.)
    x_max = np.percentile(coefficients, 99.)
    pyplot.xlim([x_min, x_max])
    
    # 添加标签
    for j in range(num_predictors):
        axes_object.text(
            0, 
            y_coords[j], 
            predictor_names[j], 
            color=np.array([217, 95, 2], dtype=float) / 255,
            horizontalalignment='center', 
            verticalalignment='center',
            fontsize=14
        )

绘制

plot_model_coefficients(
    model=linreg_model_object,
    predictor_names=list(training_predictor_table)
)
pyplot.show()

参考

https://github.com/djgagne/ams-ml-python-course

AMS 机器学习课程

AMS机器学习课程:数据分析与预处理

AMS机器学习课程:预测雷暴旋转的基础机器学习 - 数据