Financial Applications of Reinforcement Learning¶
Advancements in Reinforcement Learning can be applied to a financial setting. In this subsection, two financial applications are introduced. The first application applies the G-learning algorithm, described in the previous section, in a goal-based wealth management environment. The second application presents a model proposed for optimal consumption, life insurance and investment with a terminal PDE as the solution. This model can be extended in multiple ways ,and can be applied in a broad range of financial scenarios, making it an attractive model for consumers. The only problem of the model is the curse of dimensionality which makes the Hamilton-Jacobi-Bellman (HJB) equation of the model in higher dimensions impossible to solve. The deep BSDE method, presented in the previous section, can be applied to the problem, thereby solving the HJB equation in higher dimensions. Although the method originates from optimal control literature, it resembles many features of the reinforcement algorithms.
G-learner for goal-based retirment plan optimization¶
Before the algorithm is applied in the retirement plan optimization setting, the G-learner needs to be described to the terminal setting. Recall from the G-learner section that the action free-energy function is
where
The optimal action policy could be derived from \(F^{\pi}(s)\) by maximizing the funcion by (15) and then evaluating \(F^{\pi}(s)\) at that point (16). Using (16) and applying it in (15) gives us the optimal action policy
These three equation (22), (23) and (24) create a system of equations for G-learning that should be solved self-consistently for \(\pi(a_t|s_t)\), \(H^{\pi}_t(s_t,a_t)\) and \(F^{\pi}_t(s_t)\) by backward recursion for \(t = T-1, ..., 0\) with terminal conditions
The system of equations can be reduced to a non-linear equation when the rewards are observed.
This is the soft relaxation of the Bellman optimality equation for the action-value Q-function, described in the G-learner section.
Before the G-learning algorithm can be applied, the signal process of the environment should be presented. Therefore, we build in the first subsection a pre-specified target for our portfolio optimization together with the return specifications. The algorithm can then optimize its behavior by adapting to the target.
The target of the investment profolio¶
Following the assumptions and notation in the model of [DH20] for the investment portfolio, the model can be described as follows. The position in the assets \(i = 1, 2, ..., n\) of the portfolio model are set in dollar values denoted as a vector \(s_t\), with components \((s_t)_i\) for a dollar value of asset \(i\) at the beginning of period \(t\). The first asset \(n=1\) of the portfolio is a risk-free bank account with risk-free interest rate \(r_f\), while the other assets are risky with uncertain returns \(r_t\), whose expected values are \(\overline{r}_t\). The covariance matrix of return is \(\Sigma_r\) of size \((N-1) \times (N-1)\). \(u_t\) are the trades of the investement portfolio at timestep \(t\), while \(c_t\) is the cash installment in the plan at time \(t\). Both \(u_t\) and \(c_t\) need to be optimized and this pair \((c_t, u_t)\) can thus be considered the action variables of the dynamic optimization problem corresponding to the retirement plan.
A pre-specified target value \(\hat{P}_{t+1}\) is set at time \(t\) for the next time step \(t+1\). The target value \(\hat{P}_{t+1}\) is set such that at step \(t\) it will exceed the next step value \(V_{t+1} = (1+ r_t)(s_t +u_t)\) of the portfolio. The reward system of the environment is determined such that a penalty is created when the portfolio performance is under-cutting the target value.
Where \(\left(\hat{P}_{t+1} - (1+r_t)(s_t + u_t)\right)_+ = \max \left( \hat{P}_{t+1} - (1+r_t)(s_t + u_t), 0 \right)\). The first term is the installment amount \(c_t\) at the beginning of time period \(t\), the second is the penalty if the portfolio underperformance the target value ,and the last term is an approximation for transaction costs using a convex function with parameter matrix \(\Omega\) and serves as a \(L_2\) regularization (see next section).
We modify (25) in two ways for two reasons. First, the two decision vriabales \(c_t\) and \(u_t\) are not independent and have the following constraint:
thereby, setting the total change in all positions equal to the cash installments at time \(t\). The second reason is that the \(max(\centerdot,0)\) operator is difficult to work with under the expectation and is therefore approximated with a quadratic function. This gives us
The benefits of the adapted value function are twofold. First, it is able to solve the constraint between the cash injection \(c_t\) and the trades \(u_t\), reducing the dimensionality of the optimization problem. The second benefit is making the reward function highly tractable by transforming it into a quadratic function in actions \(u_t\). The only drawback is that the penalization is symmetric, penalizing both \(V_{t+1} \gg \hat{P}_{t+1}\) and \(V_{t+1} \ll \hat{P}_{t+1}\). To mitigate the drawback, we only consider target values considerably higher than the expectation of the next-period portfolio value. A good choice for the target value \(\hat{P}_{t+1}\) can for example be a linear combination of a portfolio-independent benchmark \(B_t\) and a portfolio fixed-rate growth \(\eta\):
where \(0 \leq \rho \leq 1\) is the relative weight of the two terms and \(\eta >1 \) defines the desired growth rate of the current portfolio. Note that \(B_t\) and \(\eta\) need to be sufficiently large such that (26) is a reasonable proxy of (25). The advantage of the target portfolio is that both parameters can be learned from an observed behavior of a financial agent using Inverse Reinforcement Learning.
Equation (26) can be written in a quadratic form, once we denote the return as \(r_t =\overline{r}_t + \tilde{\epsilon}_t\) where the first component is the risk-free rate and where \(\tilde{\epsilon}_t\) is an idiosyncratic noise with covariance \(\Sigma_r\) of size \((N-1) \times (N-1)\).
where
The free parameters defining the reward function are thus \(\lambda\), \(\eta\), \(\rho\) and \(\Omega\)
Applying the G-learning to the investment portfolio¶
A semi-analytical formulation of the G-learning is applied following Dixen et al. [DH20]. First, a functional form of the value function as a quadratic form if \(s_t\) is applied:
The dynamic equation is written as follows
where the expected return \(\overline{r}_t\) are available as an output of a separate statistical model like a factor model framework. The coefficients of (29) are computed backward in time starting from the last maturity \(t = T-1\). At time-step \(t= T-1\), the reward function (29) can be optimized using an analytical approach by the following action:
where \(\tilde{\Sigma}_{T-1}\) is defined as
Notice that the last term \(\Omega\) which represents the convex costs in (25), creates an \(L_2\) regularization of matrix inversion in (32).
Now that the reward function is solved for time-step \(t= T-1 \), the coeffiecients \(F_{T-1}^{(ss)}\), \(F_{T-1}^{(s)}\) and \(F_{T-1}^{(0)}\) of the value function can be calculated. We know that at time-step \(t = T-1\), the value function needs to equal the reward function \( F_{T-1}^{\pi}(s_{T-1})= \hat{R}_{T-1}\). Indeed, by plugging (31) back into (27) and comparing the result with (29), we get the following terminal conditions for the parameters of (29):
Any other time-step \(t =T-2,..., 0\) is computed using backward recursion following the Bellman equation but first, the conditional expectation of the next period F-function is computed using (30).
where \(\overline{F}_{t+1}^{ss} = \mathbb{E}[F_{t+1}^{(ss)}]\) similarly for \(\overline{F}_{t+1}^{(x)}\) and \(\overline{F}_{t+1}^{(0)}\). Now, we plug in the conditional expectation of the F-function and the reward function in the Bellman equation
Notice that both the reward function (27) and the conditional expectation of the F-function are quadratic functions of \(x_t\) and \(u_t\) in the Bellman equation. This means that the action-value function is also a quadratic function of \(x_t\) and \(u_t\):
where
Now that the action-value function is computed, we only need to calculate the F-function for the current step. Remember from (16) that the F-function can be described by the prior policy distribution \(\pi_0(u_t|x_t)\) and the action value function \(H_t^{\pi}(s_t, u_t)\) .
Let the prior policy be a Gaussian:
where the mean value is a linear function of the state \(x_t\).
By applying the n-dimesnional Gaussian integration formula, the integration over \(u_t\) in (35) can be performed analytically
where \(|A|\) denotes the determinant of matrix \(A\).
Once the Gaussian integration is performed, we can again compare the resulting expression with (29) and obtain the following coefficients:
with auxiliary parameters
Now from (24) we know that the optimal policy is
By applying the action-value function \(G^{\pi}_t(x_t,u_T)\) and the entropy regularized value function (36) to the optimal policy, we obtain the update of the policy parameters for time-step \(t\).
where
The G-learning algorithm is like any other RL algorithm and follows the GLI. The algorithm has thus a policy evaluation and a policy improvement process. Namely, equation (33)and (36) evaluate the current policy, while (37) improves the policy. These processes work in tandem at each time step \(t\), until the convergence criterium is met. Note, that there is an additional step to calculate the optimal contribution at each time step, because of the budget constraint developed in the subsection of building the target. By simply adding each optimal action \(u_t\) at time step \(t\) we can obtain the optimal cash contribution. A numerical example of the G-learner algorithm for a goal-based retirement plan can be found in [DHB20].
Reinforcement learning in optimal control: applying the Deep BSDE method¶
For the Deep BSDE method to be useful in financial scenarios, a financial model for which the solution is a terminal PDE should first be derived. In this chapter, we introduce such a model called the optimal consumption, investment ,and life insurance model. Before applying the model following Ye [Ye06], an introduction to the model is given together with various extensions of Ye formulation.
Optimal consumption, investment and life insurance model¶
The first person to include uncertain lifetime and life insurance decisions in a discrete life-cycle model was Yaari [Yaa65]. He explored the model using a utility function without bequest (Fisher Utility function) and a utility function with bequest (Marshall Utility function) in a bounded lifetime. In both cases, he looked at the implications of including life insurance. Although Yaari’s model was revolutionary in the sense that now the uncertainty of life could be modeled, Leung [Leu94] found that the constraints laid upon the Fisher utility function were not adequate and lead to terminal wealth depletion. Richard [Ric75] applied the methodology of Merton [Mer69, Mer75] to the problem setting of Yaari in a continuous time frame. Unfortunately, Richard’s model had one deficiency: The bounded lifetime is incompatible with the dynamic programming approach used in Merton’s model. As an individual approaches his maximal possible lifetime T, he will be inclined to buy an infinite amount of life insurance. To circumvent this, Richard used an artificial condition on the terminal value. But due to the recursive nature of dynamic programming, modifying the last value would imply modifying the whole result. Ye [Ye06] found a solution to the problem by abandoning the bounded random lifetime and replacing it with a random variable taking values in \([0,\infty)\). The models that replaced the bounded lifetime, are thereafter called intertemporal models as the models did not consider the whole lifetime of an individual but rather looked at the planning horizon of the consumer. Note that the general setting of Ye [Ye06] has a wide range of theoretical variables, while still upholding a flexible approach to different financial settings. On this account, it is a good baseline to confront the issues concerning the current models of financial planning.
After Ye [Ye06], various models have been proposed which all have given rise to unique solutions to the consumption, investment, and insurance problem. The first unique setting is a model with multiple agents involved. For example, Bruhn and Steffensen [BS11] analyzed the optimization problem for couples with correlated lifetimes with their partner nominated as their beneficiary using a copula and common-shock model, while Wei et al.[WCJW20] studied optimization strategies for a household with economically and probabilistically dependent persons. Another setting is where certain constraints are used to better describe the financial situation of consumers. Namely, Kronborg and Steffensen [KS15] discussed two constraints. One constraint is a capital constraint on the savings in which savings cannot drop below zero. The other constrain involves a minimum return in savings. A third setting describes models that analyze the financial market and insurance market in a pragmatic environment. A good illustration is the study of Shen and Wei [SW16]. They incorporate all stochastic processes involved in the investment and insurance market where all randomness is described by a Brownian motion filtration. An interesting body of models is involved in time-inconsistent preferences. In this framework, consumers do not have a time-consistent rate of preference as assumed in the economic literature. There exists rather a divergence between earlier intentions and later choices De-Paz et al. [DPMSNR14]. This concept is predominantly described in psychology. Specifically, rewards presented closer to the present are discounted proportionally less than rewards further into the future. An application of time-inconsistent preferences in the consumption, investment, and insurance optimization can be found in Chen and Li [CL20] and De-Paz et al. [DPMSNR14].
The model specifications¶
In this section, I will set the dynamics for the baseline model in place. The dynamics follow primarily from the paper of Ye [Ye06].
Let the state of the economy be represented by a standard Brownian motion \(W(t)\), the state of the consumer’s wealth be characterized by a finite state multi-dimensional continuous-time Markov chain \(X(t)\) and let the time of death be defined by a non-negative random variable \(\tau\). All are defined on a given probability space (\(\Omega, \mathcal{F}, \mathbb{P} \)) and \(W(t)\) is independent of \(\tau\). Let \(T< \infty\) be a fixed planning horizon. This can be seen as the end of the working life for the consumer. \(\mathbb{F} = \{\mathcal{F}_t, t \in [0,T]\}\), be the P-augmentation of the filtration \(\sigma\){\(W(s), s<t \}, \forall t \in [0,T]\) , so \(\mathcal{F}_t\) represents the information at time t. The economy consist of a financial market and an insurance market. In the following section I will construct these markets seperately following Ye [Ye06].
The financial market consist of a risk-free security \(B(t)\) and a risky security \(S(t)\), which evolve according to
where \(\mu, \sigma, r > 0\) are constants and \(\mu(t), r(t), \sigma(t): [0,T] \to R\) are continuous. With \(\sigma(t)\) satisfying \(\sigma^2(t) \ge k, \forall t \in [0,T]\)
The random variable \(\tau_d\) needs to be first modeled for the insurance market. Assume that \(\tau\) has a probability density function \(f(t)\) and probability distribution function given by
assuming \(\tau\) is independent of the filtration \(\mathbb{F}\).
Following on the probability distribution function we can define the survival function as follows
The hazard function is the instantaneous death rate for the consumer at time t and is defined by
where \(\lambda(t): [0,\infty[ \to R^+\) is a continuous, deterministic function with \(\int_0^\infty \lambda(t) dt = \infty\).
Subsequently, the survival and probability density function can be characterized by
with conditional probability described as
Now that \(\tau\) has been modeled, the life insurance market can be constructed. Let’s assume that life insurance is continuously offered and that it provides coverage for an infinitesimally small period of time. In return, the consumer pays a premium rate \(p\) when he enters into a life insurance contract so that he might insure his future income. In compensation, he will receive a total benefit of \(\frac{p}{\eta(t)}\) when he dies at time t. Where \(\eta : [0,T] \to R^+ \) is a continuous, deterministic function.
Both markets are now described and the wealth process \(X(t)\) of the consumer can now be constructed. Given an initial wealth \(x_0\), the consumer receives a certain amount of income \(i(t)\) \(\forall t \in [0,\tau \wedge T]\) and satisfying \(\int_0^{\tau \wedge T} i(u)du < \infty\). He needs to choose at time t a certain premium rate \(p(t)\), a certain consumption rate \(c(t)\) and a certain amount of his wealth \(\theta (t)\) that he invest into the risky asset \(S(t)\). So given the processes \(\theta\), c, p and i, there is a wealth process \(X(t)\) \(\forall t \in [0, \tau \wedge T] \) determined by
If \(t=\tau\) then the consumer will receive the insured amount \(\frac{p(t)}{\eta(t)}\). Given his wealth X(t) at time t, his total legacy will be
The predicament for the consumer is that he needs to choose the optimal rates for c, p , \(\theta\) from the set \(\mathcal{A}\) , called the set of admissible strategies, defined by
such that his expected utility from consumption, from legacy when \(\tau > T\) and from terminal wealth when \(\tau \leq T \) is maximized.
Where \(U(c,t)\) is the utility function of consumption, \(B(Z,t)\) is the utility function of legacy ,and \(L(X)\) is the utility function for the terminal wealth. \(V(x)\) is called the value function and the consumers wants to maximize his value function by choosing the optimal set \(\mathcal{A} = (c,p,\theta)\). The optimal set \(\mathcal{A}\) is found by using the dynamic programming technique described in the following section.
Dynamic programming principle¶
To solve the consumer’s problem the value function needs to be restated in a dynamic programming form.
The value function becomes
Because \(\tau\) is independent of the filtration, the value function can be rewritten as
The optimization problem is now converted from a random closing time point to a fixed closing time point. The mortality rate can also be seen as a discounting function for the consumer as he would value the utility on the probability of survival.
Following the dynamic programming principle, we can rewrite this equation as the value function at time \(s\) plus the value created from time step \(t\) to time step \(s\). This enables us to view the optimization problem into a time step setting, giving us the incremental value gained at each point in time.
The Hamiltonian-Jacobi-bellman (HJB) equation can be derived from the dynamic programming principle and is as follows
where
Proofs for deriving the HJB equation, dynamic programming principle ,and converting from a random closing time point to a fixed closing time point can be found in Ye [Ye06]
A strategy is optimal if
The first order conditions for regular interior maximum are
The second order conditions are
This optimal control problem has been solved analytically by Ye [Ye06] for the Constant Relative Risk Aversion utility function. To solve (38) the BSDE method can be used. The Deep BSDE method was the first deep learning-based numerical algorithm to solve general nonlinear parabolic PDEs in high dimensions.
Remember the general form of PDEs which the Deep BSDE method solves:
with some terminal condition \(u(T,x) = g(x)\). (38) can thus be reformulated in the general form:
The BSDE method can thus be applied.