Random Forest Example: Cloud-resolving model sensitivity

In this example, we will use an ensemble of large-domain simulations of realistic shallow cloud fields to explore the sensitivity of shallow precipitation to local changes in the environment.

The simulation data we use for training the emulator is taken from a recent study Dagan and Stier (2020), where they performed ensemble daily simulations for one month-long period during December 2013 over the ocean to the East of Barbados, such that they samepled variability associated with shallow convection. Each day of the month consisted of two runs, both forced by realistic boundary conditions taken from reanalysis, but with different cloud droplet number concentrations (CDNC) to represent clean and polluted conditions. The altered CDNC was found to have little impact on the precipitation rate in the simulations, and so we simply treat the CDNC change as a perturbation to the initial conditions, and combine the two CDNC runs from each day together to increase the amount of data available for training the emulator. At hourly resolution, this provides us with 1488 data points.

However, given that the amount of precipitation is strongly tied to the local cloud regime, not fully controlling for cloud regime can introduce spurious correlations when training the emulator. As such we also filter out all hours which are not associated with shallow convective clouds. To do this, we consider domain-mean vertical profiles of total cloud water content (liquid + ice), q\(_{t}\), and filter out all hours where the vertical sum of q\(_{t}\) below 600hPa exceeds 10\(^{-6}\) kg/kg. This condition allows us to filter out hours associated with the onset and development of deep convection in the domain, as well as masking out hours with high cirrus layers or hours dominated by transient mesoscale convective activity which is advected in by the boundary conditions. After this, we are left with 850 hourly data point which meet our criteria and can be used to train the emulator.

In this example we emulate hourly precip using domain-mean features: liquid water path (LWP), geopotential height at 700hPa (z:math:_{700}), Estimated Inversion Strength (EIS), sea-surface temperature (SST) and the vertical pressure velocity at 700hPa (w700).

4d9933f8e73244ab95a35c944dd1a507

References:

Dagan, G. and Stier, P.: Ensemble daily simulations for elucidating cloud–aerosol interactions under a large spread of realistic environmental conditions, Atmos. Chem. Phys., 20, 6291–6303, https://doi.org/10.5194/acp-20-6291-2020, 2020.

[1]:
import numpy as np
import pandas as pd
import iris

from utils import get_crm_data
from esem.utils import plot_results, prettify_plot, add_121_line, leave_one_out

from esem import rf_model

from matplotlib import pyplot as plt
plt.style.use('default')
%matplotlib inline

Concatenate 20cdnc and 200cdnc runs into one dataframe

[2]:
df_crm = get_crm_data()
df_crm
[2]:
precip pres_msl LWP WS10 lhfl_s shfl_s LTS w500 w600 w700 wmax850 wstd850 zg500 zg600 zg700 rh850 rh700 u_shear EIS SST
0 0.004593 101407.410 0.035898 6.639860 -167.53857 5.745860 13.180252 -0.014463 -0.012311 -0.010275 -0.000024 0.000947 56627.516 42694.220 30541.566 67.243774 60.067740 -4.662799 0.989443 301.173248
1 0.006900 101356.266 0.044468 6.822748 -176.93939 4.438721 13.279678 -0.015064 -0.012710 -0.008676 0.000030 0.000382 56572.645 42640.473 30488.172 69.299180 58.453730 -4.322696 1.130803 301.173248
2 0.008916 101316.420 0.051559 6.798490 -182.61697 3.649221 13.333527 -0.014811 -0.012014 -0.006025 0.000642 0.000511 56525.613 42593.594 30442.703 71.522900 56.912193 -3.925541 1.242463 301.173248
3 0.008932 101270.490 0.057509 6.756970 -188.87599 3.033055 13.328018 -0.013470 -0.012141 -0.004758 0.001519 0.000476 56471.332 42539.062 30390.662 74.115690 55.652990 -3.556973 1.304206 301.173248
4 0.016204 101256.270 0.064226 6.763690 -194.85498 2.826119 13.317032 -0.010917 -0.011119 -0.003158 0.003252 0.000958 56443.758 42510.805 30364.852 77.510765 54.434470 -3.319007 1.362710 301.173248
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
845 0.063121 101309.750 0.064794 8.253145 -191.23718 12.219704 10.142059 -0.024480 -0.006400 -0.007968 -0.000044 0.001105 56084.273 42214.140 30222.945 83.696740 77.278465 -5.993636 -2.696190 300.126465
846 0.064601 101303.110 0.063914 8.326073 -192.57118 11.947702 10.162674 -0.019426 0.000300 -0.003904 -0.000034 0.000588 56071.547 42206.734 30214.715 84.196236 77.536760 -5.848422 -2.673406 300.126465
847 0.046773 101332.234 0.059974 8.404624 -193.80084 12.372276 10.166580 -0.014384 0.004355 -0.000284 -0.000251 0.000650 56079.492 42225.984 30236.312 84.394960 77.754560 -5.663757 -2.643809 300.126465
848 0.056623 101394.280 0.062895 8.385845 -192.18195 13.336615 10.149658 -0.016936 0.002702 0.000667 0.000013 0.000509 56117.140 42272.740 30286.500 84.437530 78.009740 -5.427930 -2.635981 300.126465
849 0.064975 101438.690 0.069100 8.429897 -192.28928 13.679647 10.164475 -0.021005 0.000413 0.000406 -0.000102 0.000838 56146.297 42305.730 30322.684 84.389620 78.030650 -5.215088 -2.612224 300.126465

850 rows × 20 columns

Extract the precipitation timeseries as target data

[3]:
precip = df_crm['precip'].to_numpy().reshape(-1 ,1)

Visualize the precipitation landscape

In the ensemble, shallow precipitation is highly correlated with many different physical features. Most obviously there is a high correlation with liquid water path (LWP), 10-metre windspeed (WS10) and geopotential height at 700hPa (z_{700}).

We can use these correlations to create “collapsing spaces” for investigating the relationships between shallow precipitation and the local meteorological environment.

[4]:
fig, axs = plt.subplots(ncols=3, figsize=(13,5), dpi=100, gridspec_kw={'width_ratios':[1,1,1]})

axs[0].scatter(df_crm['LWP'], df_crm['precip'], s=30, marker='o', alpha=0.7, label=r'CDNC$_{20}$ and CDNC$_{200}$')
axs[0].set_xlabel('LWP [kg/m$^{2}$]')
axs[0].set_ylabel('Precip [mm/hr]')
prettify_plot(axs[0])

axs[1].scatter(df_crm['WS10'], df_crm['precip'], s=30, marker='o', alpha=0.7, label=r'CDNC$_{20}$ and CDNC$_{200}$')
axs[1].set_xlabel('WS10 [m/s]')
axs[1].set_ylabel('Precip [mm/hr]')
prettify_plot(axs[1])

axs[2].scatter(df_crm['zg700'], df_crm['precip'], s=30, marker='o', alpha=0.7, label=r'CDNC$_{20}$ and CDNC$_{200}$')
axs[2].set_xlabel('z700 [m]')
axs[2].set_ylabel('Precip [mm/hr]')
prettify_plot(axs[2])

fig.tight_layout()
#plt.savefig("Figs/1hr_D13shal_lwp-ws-pr.png", dpi=200)
../_images/examples_CRM_Emulation_with_RandomForest_7_0.png

Also, good to note that each of these predictors ``(LWP, WS10, z700)`` are mutually uncorrelated (see plots below)

[5]:
fig, axs = plt.subplots(ncols=3, figsize=(13,4), dpi=100, gridspec_kw={'width_ratios':[1,1,1]})

axs[0].scatter(df_crm['LWP'], df_crm['WS10'], s=30, marker='o', alpha=0.7, label=r'CDNC$_{20}$ and CDNC$_{200}$')
axs[0].set_xlabel('LWP [kg/m$^{2}$]')
axs[0].set_ylabel('WS10 [m/s]')
prettify_plot(axs[0])

axs[1].scatter(df_crm['LWP'], df_crm['zg700'], s=30, marker='o', alpha=0.7, label=r'CDNC$_{20}$ and CDNC$_{200}$')
axs[1].set_xlabel('LWP [kg/m$^{2}$]')
axs[1].set_ylabel('z700 [m]')
prettify_plot(axs[1])

axs[2].scatter(df_crm['WS10'], df_crm['zg700'], s=30, marker='o', alpha=0.7, label=r'CDNC$_{20}$ and CDNC$_{200}$')
axs[2].set_xlabel('z700 [m]')
axs[2].set_ylabel('WS10 [m/s]')
prettify_plot(axs[2])

fig.tight_layout()
#plt.savefig("Figs/1hr_D13shal_lwp-ws-pr.png", dpi=200)
../_images/examples_CRM_Emulation_with_RandomForest_9_0.png

Stratifying precip by pairs of these predictors creates nice “collapsing spaces”

Nice for illustrating how the emulated surface compares to the raw data

[6]:
fig, axs = plt.subplots(ncols=3, figsize=(13,5), dpi=100, gridspec_kw={'width_ratios':[1,1,0.05]})

sc1 = axs[0].scatter(df_crm['LWP'], df_crm['zg700'], c=df_crm['precip'],
                vmin=0, vmax=0.12, s=30, marker='o', alpha=0.7, label=r'CDNC$_{20}$ and CDNC$_{200}$')

axs[0].set_xlabel(r'LWP [kg m$^{-2}$]')
axs[0].set_ylabel(r'z$_{700}$ [m]')

axs[0].set_title("Hourly output: Dec2013, shallow clouds")
axs[0].legend(loc='lower right')

prettify_plot(axs[0])


sc2 = axs[1].scatter(df_crm['LWP'], df_crm['WS10'], c=df_crm['precip'],
                vmin=0, vmax=0.12, s=30, marker='o', alpha=0.7, label=r'CDNC$_{20}$ and CDNC$_{200}$')

axs[1].set_xlabel(r'LWP [kg m$^{-2}$]')
axs[1].set_ylabel(r'10m-windspeed [m s$^{-1}$]')

axs[1].set_title("Hourly output: Dec2013, shallow clouds")
axs[1].legend(loc='lower right')

prettify_plot(axs[1])

fig.colorbar(sc1, cax=axs[2], label=r'precip [mm hr$^{-1}$]')

fig.tight_layout()
#plt.savefig("Figs/1hr_D13shal_lwp-ws-pr.png", dpi=200)
../_images/examples_CRM_Emulation_with_RandomForest_11_0.png

Emulation

Our aim is to emulate shallow precipitation as a function of the environmental conditions, and then plot the predictions in LWP-z700 space to compare with the scatter points above.

To do this we choose a set of predictors which are typical “cloud-controlling factors” such as SST, Estimated Inversion Strength, vertical velocity at 700 hPa, LWP and z700. Other variables could also be chosen and it’s worth exploring this just to get a sense for how the model behaves.

After validating the model using Leave-One-Out cross-validation, we then retrain the model using the full dataset, and use this model to predict the precipitation across a wide range of values. Finally, for the purpose of plotting in LWP-z700 space, we reduce the dimensionality of our final prediction by averaging over all features with aren’t LWP or z700. This gives us a smooth field to compare with the scatter points.

[7]:
params = df_crm.loc[:, ['LWP', 'zg700', 'EIS', 'SST', 'w700']]

print("The input params are: \n", params, "\n")
The input params are:
           LWP      zg700       EIS         SST      w700
0    0.035898  30541.566  0.989443  301.173248 -0.010275
1    0.044468  30488.172  1.130803  301.173248 -0.008676
2    0.051559  30442.703  1.242463  301.173248 -0.006025
3    0.057509  30390.662  1.304206  301.173248 -0.004758
4    0.064226  30364.852  1.362710  301.173248 -0.003158
..        ...        ...       ...         ...       ...
845  0.064794  30222.945 -2.696190  300.126465 -0.007968
846  0.063914  30214.715 -2.673406  300.126465 -0.003904
847  0.059974  30236.312 -2.643809  300.126465 -0.000284
848  0.062895  30286.500 -2.635981  300.126465  0.000667
849  0.069100  30322.684 -2.612224  300.126465  0.000406

[850 rows x 5 columns]

LeaveOneOut cross-validation and plotting

[8]:
%%time

# Ignore the mountain of warnings
import warnings
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

outp = np.asarray(leave_one_out(Xdata=params, Ydata=precip, model='RandomForest', n_estimators=50, random_state=0))


Wall time: 2min 12s
<timed exec>:6: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
[9]:
truth_RF, pred_RF = outp[:,0], outp[:,1]
[10]:
from sklearn.metrics import mean_squared_error

""" Validation plot """
fig, ax = plt.subplots(dpi=150)

plot_results(ax, truth_RF, pred_RF, title="Random Forest Validation, Hourly data: NARVAL1_Shallow")

fig.tight_layout()
../_images/examples_CRM_Emulation_with_RandomForest_17_0.png

Now, retrain model on all data, and extrapolate over whole space

[11]:
X_train = params.to_numpy()
Y_train = precip.ravel()
model = rf_model(X_train, Y_train)
[12]:
model.train()
[13]:
%%time

# Now, make grid for plotting RF predictions
# more n_points means higher resolution, but takes exponentially longer
n_points = 30

min_vals = params.min()
max_vals = params.max()

# For uniform prediction over full params space
space=np.linspace(min_vals, max_vals, n_points)

# Reshape to (N,D)
reshape_to_ND = np.transpose(space)
Xs_uniform = np.meshgrid(*reshape_to_ND)
test = np.array([_.flatten() for _ in Xs_uniform]).T

# Predict
predictions,_ = model.predict(test)
predictions   = predictions.reshape(Xs_uniform[0].shape)

# Now, take mean over all parameters except [LWP, z700], assumed to be first 2 indices
predictions_reduced = np.mean(predictions, axis=tuple(range(2, predictions.ndim)))
Wall time: 1min 35s
[14]:
# Now, make grid for plotting RF predictions

LWP_grid   = np.linspace(min_vals['LWP'], max_vals['LWP'], num=n_points)
zg700_grid = np.linspace(min_vals['zg700'], max_vals['zg700'], num=n_points)

lwp, zg = np.meshgrid(LWP_grid, zg700_grid)

Plot!

[15]:
fig, ax = plt.subplots(ncols=3, figsize=(7,4), dpi=250, gridspec_kw={'width_ratios':[1, 0.05, 0.05]})
fig.suptitle("Hourly output: Dec2013, shallow clouds")

cp = ax[0].pcolormesh(LWP_grid, zg700_grid,
                      predictions_reduced, vmin=0, vmax=0.12, alpha=1)

fig.colorbar(cp, cax=ax[1], orientation='vertical', shrink=0.05, label=r'Precip [mm hr$^{-1}$]')

"""Overlap errors"""
ax[0].scatter(df_crm['LWP'], df_crm['zg700'], c=df_crm['precip'],
                    vmin=0, vmax=0.12, s=30, marker='o', edgecolors="None")

ers = ax[0].scatter(df_crm['LWP'], df_crm['zg700'], c=(truth_RF-pred_RF)/(truth_RF+pred_RF), facecolors="None",
                    vmin=-1, vmax=1, s=30, marker='o', lw=0.7, alpha=0.5, cmap='seismic')
ers.set_facecolors("None")
fig.colorbar(ers, cax=ax[2], label=r'$\frac{\mathrm{Truth-Prediction}}{\mathrm{Truth+Prediction}}$')

for idx, _ in enumerate(ax[:1]):
    _.set_xlabel(r'LWP [kg m$^{-2}$]')
    _.set_ylabel(r'z$_{700}$ [m]')

    if idx==0:
        _.set_xlim(min_vals['LWP'], max_vals['LWP'])
        _.set_ylim(min_vals['zg700'], max_vals['zg700'])


fig.tight_layout()
fig.subplots_adjust(top=0.85) # Put this AFTER tight_layout() call !

"""Add validation plot as inset to first axis"""
axins = ax[0].inset_axes([0.79, 0.79, 0.2, 0.2], # x0, y0, width, height
                          transform=ax[0].transAxes)
axins.scatter(truth_RF, pred_RF, s=2, alpha=0.3)
add_121_line(axins)
axins.set_xlabel('Model', position=(0.5,0), fontsize=8, labelpad=-0.01)
axins.set_xticks([0, 0.05, 0.1])
axins.set_xticklabels(labels=[0, 0.05, 0.1], fontdict={'fontsize':6})

axins.set_ylabel('Emulator', position=(0,0.5), fontsize=8, labelpad=-0.01)
axins.set_yticks([0, 0.05, 0.1])
axins.set_yticklabels(labels=[0, 0.05, 0.1], fontdict={'fontsize':6})

axins.tick_params(axis='both', which='major', pad=0.01)

#plt.savefig("./Figs/1hr_D13shal_lwp-zg700-pr_w-errorpoints.png", dpi=300)
C:\Users\duncan\AppData\Local\Temp/ipykernel_21624/1073200818.py:4: MatplotlibDeprecationWarning: shading='flat' when X and Y have the same dimensions as C is deprecated since 3.3.  Either specify the corners of the quadrilaterals with X and Y, or pass shading='auto', 'nearest' or 'gouraud', or set rcParams['pcolor.shading'].  This will become an error two minor releases later.
  cp = ax[0].pcolormesh(LWP_grid, zg700_grid,
../_images/examples_CRM_Emulation_with_RandomForest_24_1.png
[ ]: