Random Forest Example: Cloud-resolving model sensitivity¶

In this example, we will use an ensemble of large-domain simulations of realistic shallow cloud fields to explore the sensitivity of shallow precipitation to local changes in the environment.

The simulation data we use for training the emulator is taken from a recent study Dagan and Stier (2020), where they performed ensemble daily simulations for one month-long period during December 2013 over the ocean to the East of Barbados, such that they samepled variability associated with shallow convection. Each day of the month consisted of two runs, both forced by realistic boundary conditions taken from reanalysis, but with different cloud droplet number concentrations (CDNC) to represent clean and polluted conditions. The altered CDNC was found to have little impact on the precipitation rate in the simulations, and so we simply treat the CDNC change as a perturbation to the initial conditions, and combine the two CDNC runs from each day together to increase the amount of data available for training the emulator. At hourly resolution, this provides us with 1488 data points.

However, given that the amount of precipitation is strongly tied to the local cloud regime, not fully controlling for cloud regime can introduce spurious correlations when training the emulator. As such we also filter out all hours which are not associated with shallow convective clouds. To do this, we consider domain-mean vertical profiles of total cloud water content (liquid + ice), q\(_{t}\), and filter out all hours where the vertical sum of q\(_{t}\) below 600hPa exceeds 10\(^{-6}\) kg/kg. This condition allows us to filter out hours associated with the onset and development of deep convection in the domain, as well as masking out hours with high cirrus layers or hours dominated by transient mesoscale convective activity which is advected in by the boundary conditions. After this, we are left with 850 hourly data point which meet our criteria and can be used to train the emulator.

In this example we emulate hourly precip using domain-mean features: liquid water path (LWP), geopotential height at 700hPa (z:math:_{700}), Estimated Inversion Strength (EIS), sea-surface temperature (SST) and the vertical pressure velocity at 700hPa (w700).

4d9933f8e73244ab95a35c944dd1a507

References:

Dagan, G. and Stier, P.: Ensemble daily simulations for elucidating cloud–aerosol interactions under a large spread of realistic environmental conditions, Atmos. Chem. Phys., 20, 6291–6303, https://doi.org/10.5194/acp-20-6291-2020, 2020.

[1]:

import numpy as np
import pandas as pd
import iris

from utils import get_crm_data
from esem.utils import plot_results, prettify_plot, add_121_line, leave_one_out

from esem import rf_model

from matplotlib import pyplot as plt
plt.style.use('default')
%matplotlib inline

Concatenate 20cdnc and 200cdnc runs into one dataframe

[2]:

df_crm = get_crm_data()
df_crm

[2]:

	precip	pres_msl	LWP	WS10	lhfl_s	shfl_s	LTS	w500	w600	w700	wmax850	wstd850	zg500	zg600	zg700	rh850	rh700	u_shear	EIS	SST
0	0.004593	101407.410	0.035898	6.639860	-167.53857	5.745860	13.180252	-0.014463	-0.012311	-0.010275	-0.000024	0.000947	56627.516	42694.220	30541.566	67.243774	60.067740	-4.662799	0.989443	301.173248
1	0.006900	101356.266	0.044468	6.822748	-176.93939	4.438721	13.279678	-0.015064	-0.012710	-0.008676	0.000030	0.000382	56572.645	42640.473	30488.172	69.299180	58.453730	-4.322696	1.130803	301.173248
2	0.008916	101316.420	0.051559	6.798490	-182.61697	3.649221	13.333527	-0.014811	-0.012014	-0.006025	0.000642	0.000511	56525.613	42593.594	30442.703	71.522900	56.912193	-3.925541	1.242463	301.173248
3	0.008932	101270.490	0.057509	6.756970	-188.87599	3.033055	13.328018	-0.013470	-0.012141	-0.004758	0.001519	0.000476	56471.332	42539.062	30390.662	74.115690	55.652990	-3.556973	1.304206	301.173248
4	0.016204	101256.270	0.064226	6.763690	-194.85498	2.826119	13.317032	-0.010917	-0.011119	-0.003158	0.003252	0.000958	56443.758	42510.805	30364.852	77.510765	54.434470	-3.319007	1.362710	301.173248
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
845	0.063121	101309.750	0.064794	8.253145	-191.23718	12.219704	10.142059	-0.024480	-0.006400	-0.007968	-0.000044	0.001105	56084.273	42214.140	30222.945	83.696740	77.278465	-5.993636	-2.696190	300.126465
846	0.064601	101303.110	0.063914	8.326073	-192.57118	11.947702	10.162674	-0.019426	0.000300	-0.003904	-0.000034	0.000588	56071.547	42206.734	30214.715	84.196236	77.536760	-5.848422	-2.673406	300.126465
847	0.046773	101332.234	0.059974	8.404624	-193.80084	12.372276	10.166580	-0.014384	0.004355	-0.000284	-0.000251	0.000650	56079.492	42225.984	30236.312	84.394960	77.754560	-5.663757	-2.643809	300.126465
848	0.056623	101394.280	0.062895	8.385845	-192.18195	13.336615	10.149658	-0.016936	0.002702	0.000667	0.000013	0.000509	56117.140	42272.740	30286.500	84.437530	78.009740	-5.427930	-2.635981	300.126465
849	0.064975	101438.690	0.069100	8.429897	-192.28928	13.679647	10.164475	-0.021005	0.000413	0.000406	-0.000102	0.000838	56146.297	42305.730	30322.684	84.389620	78.030650	-5.215088	-2.612224	300.126465

850 rows × 20 columns

Extract the precipitation timeseries as target data

[3]:

precip = df_crm['precip'].to_numpy().reshape(-1 ,1)

Visualize the precipitation landscape¶

In the ensemble, shallow precipitation is highly correlated with many different physical features. Most obviously there is a high correlation with liquid water path (LWP), 10-metre windspeed (WS10) and geopotential height at 700hPa (z_{700}).

We can use these correlations to create “collapsing spaces” for investigating the relationships between shallow precipitation and the local meteorological environment.

[4]:

fig, axs = plt.subplots(ncols=3, figsize=(13,5), dpi=100, gridspec_kw={'width_ratios':[1,1,1]})

axs[0].scatter(df_crm['LWP'], df_crm['precip'], s=30, marker='o', alpha=0.7, label=r'CDNC$_{20}$ and CDNC$_{200}$')
axs[0].set_xlabel('LWP [kg/m$^{2}$]')
axs[0].set_ylabel('Precip [mm/hr]')
prettify_plot(axs[0])

axs[1].scatter(df_crm['WS10'], df_crm['precip'], s=30, marker='o', alpha=0.7, label=r'CDNC$_{20}$ and CDNC$_{200}$')
axs[1].set_xlabel('WS10 [m/s]')
axs[1].set_ylabel('Precip [mm/hr]')
prettify_plot(axs[1])

axs[2].scatter(df_crm['zg700'], df_crm['precip'], s=30, marker='o', alpha=0.7, label=r'CDNC$_{20}$ and CDNC$_{200}$')
axs[2].set_xlabel('z700 [m]')
axs[2].set_ylabel('Precip [mm/hr]')
prettify_plot(axs[2])

fig.tight_layout()
#plt.savefig("Figs/1hr_D13shal_lwp-ws-pr.png", dpi=200)

../_images/examples_CRM_Emulation_with_RandomForest_7_0.png

Also, good to note that each of these predictors ``(LWP, WS10, z700)`` are mutually uncorrelated (see plots below)

[5]:

fig, axs = plt.subplots(ncols=3, figsize=(13,4), dpi=100, gridspec_kw={'width_ratios':[1,1,1]})

axs[0].scatter(df_crm['LWP'], df_crm['WS10'], s=30, marker='o', alpha=0.7, label=r'CDNC$_{20}$ and CDNC$_{200}$')
axs[0].set_xlabel('LWP [kg/m$^{2}$]')
axs[0].set_ylabel('WS10 [m/s]')
prettify_plot(axs[0])

axs[1].scatter(df_crm['LWP'], df_crm['zg700'], s=30, marker='o', alpha=0.7, label=r'CDNC$_{20}$ and CDNC$_{200}$')
axs[1].set_xlabel('LWP [kg/m$^{2}$]')
axs[1].set_ylabel('z700 [m]')
prettify_plot(axs[1])

axs[2].scatter(df_crm['WS10'], df_crm['zg700'], s=30, marker='o', alpha=0.7, label=r'CDNC$_{20}$ and CDNC$_{200}$')
axs[2].set_xlabel('z700 [m]')
axs[2].set_ylabel('WS10 [m/s]')
prettify_plot(axs[2])

fig.tight_layout()
#plt.savefig("Figs/1hr_D13shal_lwp-ws-pr.png", dpi=200)

../_images/examples_CRM_Emulation_with_RandomForest_9_0.png

Stratifying precip by pairs of these predictors creates nice “collapsing spaces”

Nice for illustrating how the emulated surface compares to the raw data

[6]:

fig, axs = plt.subplots(ncols=3, figsize=(13,5), dpi=100, gridspec_kw={'width_ratios':[1,1,0.05]})

sc1 = axs[0].scatter(df_crm['LWP'], df_crm['zg700'], c=df_crm['precip'],
                vmin=0, vmax=0.12, s=30, marker='o', alpha=0.7, label=r'CDNC$_{20}$ and CDNC$_{200}$')

axs[0].set_xlabel(r'LWP [kg m$^{-2}$]')
axs[0].set_ylabel(r'z$_{700}$ [m]')

axs[0].set_title("Hourly output: Dec2013, shallow clouds")
axs[0].legend(loc='lower right')

prettify_plot(axs[0])


sc2 = axs[1].scatter(df_crm['LWP'], df_crm['WS10'], c=df_crm['precip'],
                vmin=0, vmax=0.12, s=30, marker='o', alpha=0.7, label=r'CDNC$_{20}$ and CDNC$_{200}$')

axs[1].set_xlabel(r'LWP [kg m$^{-2}$]')
axs[1].set_ylabel(r'10m-windspeed [m s$^{-1}$]')

axs[1].set_title("Hourly output: Dec2013, shallow clouds")
axs[1].legend(loc='lower right')

prettify_plot(axs[1])

fig.colorbar(sc1, cax=axs[2], label=r'precip [mm hr$^{-1}$]')

fig.tight_layout()
#plt.savefig("Figs/1hr_D13shal_lwp-ws-pr.png", dpi=200)

Emulation¶

Our aim is to emulate shallow precipitation as a function of the environmental conditions, and then plot the predictions in LWP-z700 space to compare with the scatter points above.

To do this we choose a set of predictors which are typical “cloud-controlling factors” such as SST, Estimated Inversion Strength, vertical velocity at 700 hPa, LWP and z700. Other variables could also be chosen and it’s worth exploring this just to get a sense for how the model behaves.

After validating the model using Leave-One-Out cross-validation, we then retrain the model using the full dataset, and use this model to predict the precipitation across a wide range of values. Finally, for the purpose of plotting in LWP-z700 space, we reduce the dimensionality of our final prediction by averaging over all features with aren’t LWP or z700. This gives us a smooth field to compare with the scatter points.

[7]:

params = df_crm.loc[:, ['LWP', 'zg700', 'EIS', 'SST', 'w700']]

print("The input params are: \n", params, "\n")

The input params are:
           LWP      zg700       EIS         SST      w700
0    0.035898  30541.566  0.989443  301.173248 -0.010275
1    0.044468  30488.172  1.130803  301.173248 -0.008676
2    0.051559  30442.703  1.242463  301.173248 -0.006025
3    0.057509  30390.662  1.304206  301.173248 -0.004758
4    0.064226  30364.852  1.362710  301.173248 -0.003158
..        ...        ...       ...         ...       ...
845  0.064794  30222.945 -2.696190  300.126465 -0.007968
846  0.063914  30214.715 -2.673406  300.126465 -0.003904
847  0.059974  30236.312 -2.643809  300.126465 -0.000284
848  0.062895  30286.500 -2.635981  300.126465  0.000667
849  0.069100  30322.684 -2.612224  300.126465  0.000406

[850 rows x 5 columns]

LeaveOneOut cross-validation and plotting¶

[8]:

%%time

# Ignore the mountain of warnings
import warnings
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

outp = np.asarray(leave_one_out(Xdata=params, Ydata=precip, model='RandomForest', n_estimators=50, random_state=0))

Wall time: 2min 12s

<timed exec>:6: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.

[9]:

truth_RF, pred_RF = outp[:,0], outp[:,1]

[10]:

from sklearn.metrics import mean_squared_error

""" Validation plot """
fig, ax = plt.subplots(dpi=150)

plot_results(ax, truth_RF, pred_RF, title="Random Forest Validation, Hourly data: NARVAL1_Shallow")

fig.tight_layout()

../_images/examples_CRM_Emulation_with_RandomForest_17_0.png

Now, retrain model on all data, and extrapolate over whole space¶

[11]:

X_train = params.to_numpy()
Y_train = precip.ravel()
model = rf_model(X_train, Y_train)

[12]:

model.train()

[13]:

%%time

# Now, make grid for plotting RF predictions
# more n_points means higher resolution, but takes exponentially longer
n_points = 30

min_vals = params.min()
max_vals = params.max()

# For uniform prediction over full params space
space=np.linspace(min_vals, max_vals, n_points)

# Reshape to (N,D)
reshape_to_ND = np.transpose(space)
Xs_uniform = np.meshgrid(*reshape_to_ND)
test = np.array([_.flatten() for _ in Xs_uniform]).T

# Predict
predictions,_ = model.predict(test)
predictions   = predictions.reshape(Xs_uniform[0].shape)

# Now, take mean over all parameters except [LWP, z700], assumed to be first 2 indices
predictions_reduced = np.mean(predictions, axis=tuple(range(2, predictions.ndim)))

Wall time: 1min 35s

[14]:

# Now, make grid for plotting RF predictions

LWP_grid   = np.linspace(min_vals['LWP'], max_vals['LWP'], num=n_points)
zg700_grid = np.linspace(min_vals['zg700'], max_vals['zg700'], num=n_points)

lwp, zg = np.meshgrid(LWP_grid, zg700_grid)

Plot!¶

[15]:

fig, ax = plt.subplots(ncols=3, figsize=(7,4), dpi=250, gridspec_kw={'width_ratios':[1, 0.05, 0.05]})
fig.suptitle("Hourly output: Dec2013, shallow clouds")

cp = ax[0].pcolormesh(LWP_grid, zg700_grid,
                      predictions_reduced, vmin=0, vmax=0.12, alpha=1)

fig.colorbar(cp, cax=ax[1], orientation='vertical', shrink=0.05, label=r'Precip [mm hr$^{-1}$]')

"""Overlap errors"""
ax[0].scatter(df_crm['LWP'], df_crm['zg700'], c=df_crm['precip'],
                    vmin=0, vmax=0.12, s=30, marker='o', edgecolors="None")

ers = ax[0].scatter(df_crm['LWP'], df_crm['zg700'], c=(truth_RF-pred_RF)/(truth_RF+pred_RF), facecolors="None",
                    vmin=-1, vmax=1, s=30, marker='o', lw=0.7, alpha=0.5, cmap='seismic')
ers.set_facecolors("None")
fig.colorbar(ers, cax=ax[2], label=r'$\frac{\mathrm{Truth-Prediction}}{\mathrm{Truth+Prediction}}$')

for idx, _ in enumerate(ax[:1]):
    _.set_xlabel(r'LWP [kg m$^{-2}$]')
    _.set_ylabel(r'z$_{700}$ [m]')

    if idx==0:
        _.set_xlim(min_vals['LWP'], max_vals['LWP'])
        _.set_ylim(min_vals['zg700'], max_vals['zg700'])


fig.tight_layout()
fig.subplots_adjust(top=0.85) # Put this AFTER tight_layout() call !

"""Add validation plot as inset to first axis"""
axins = ax[0].inset_axes([0.79, 0.79, 0.2, 0.2], # x0, y0, width, height
                          transform=ax[0].transAxes)
axins.scatter(truth_RF, pred_RF, s=2, alpha=0.3)
add_121_line(axins)
axins.set_xlabel('Model', position=(0.5,0), fontsize=8, labelpad=-0.01)
axins.set_xticks([0, 0.05, 0.1])
axins.set_xticklabels(labels=[0, 0.05, 0.1], fontdict={'fontsize':6})

axins.set_ylabel('Emulator', position=(0,0.5), fontsize=8, labelpad=-0.01)
axins.set_yticks([0, 0.05, 0.1])
axins.set_yticklabels(labels=[0, 0.05, 0.1], fontdict={'fontsize':6})

axins.tick_params(axis='both', which='major', pad=0.01)

#plt.savefig("./Figs/1hr_D13shal_lwp-zg700-pr_w-errorpoints.png", dpi=300)

C:\Users\duncan\AppData\Local\Temp/ipykernel_21624/1073200818.py:4: MatplotlibDeprecationWarning: shading='flat' when X and Y have the same dimensions as C is deprecated since 3.3.  Either specify the corners of the quadrilaterals with X and Y, or pass shading='auto', 'nearest' or 'gouraud', or set rcParams['pcolor.shading'].  This will become an error two minor releases later.
  cp = ax[0].pcolormesh(LWP_grid, zg700_grid,

../_images/examples_CRM_Emulation_with_RandomForest_24_1.png

[ ]:

Random Forest Example: Cloud-resolving model sensitivity¶

Visualize the precipitation landscape¶

Emulation¶

LeaveOneOut cross-validation and plotting¶

Now, retrain model on all data, and extrapolate over whole space¶

Plot!¶

esem

Navigation

Related Topics