What is a gradient flow in the probability space?
Given some energy functional \(\mathcal{E}(\rho)\) in some probability space \(\mathcal{P}(\Omega)\) with some metric \(\mathcal{G}(\rho))\), \((\mathcal{P}(\Omega), \mathcal{G}(\rho))\), a gradient flow is defined as the inverse metric times the differential of the energy function \[\begin{equation} \partial_t \rho_t = -\mathcal{G}(\rho_t)^{-1} \frac{\delta \mathcal{E}(\rho_t)}{\delta \rho_t}. \end{equation}\] Here, \(\rho_t\) is a distribution at time \(t\).
Intuitively, this means that the considered system of equations follows the trajectory of steepest descend on the energy functional \(\mathcal{E}(\rho)\). To define this steepest descend we need to define the notion of the gradient. The gradient, in turn, depends on the selected geometry of the space and is computed according to the selected metric.
If we consider for energy functional the Kullback Leibler divergence \(D_{KL}\), and for (information) metric the Wasserstein metric \(\mathcal{W}\), the considered gradient flow, known as Wasserstein gradient flow, forms the Fokker-Planck equation.
In this case the metric inverse is \(\nabla \cdot \rho_t \nabla\), and we can derive the Fokker–Planck equation as follows:
\[\begin{align} \partial_t \rho_t &= - \text{grad}^{\mathcal{W}} D_{KL}(\rho_t ||\rho_{ss})\\ &= \nabla \cdot \left( \rho_t \nabla \left( f + \log \rho_t +1 \right) \right)\\ &= \nabla \cdot \left( \rho_t \nabla f\right) + \nabla \cdot \nabla \rho_t\\ &= \nabla \cdot \left( \rho_t \nabla f\right) + \Delta \rho_t. \end{align}\]
In the above equation we have considered that the stationary density is given by \(\rho_{ss} \propto e^{-f}\), and that the differential of \(\frac{\delta \mathcal{E}(\rho_t)}{\delta \rho_t} = \log \rho + f\).
By considering the Benamou-Brenier formulation (Benamou and Brenier 2000),(Ambrosio et al. 2003) of the Fokker-Planck dynamics we can obtain a better understanding on how the selected geometry (and metric) of the probability space influences the gradient flow dynamics. According to the Benamou-Brenier formalism the gradient flow dynamics for the Fokker-Planck equation has the following optimal transport interpretation: It describes a search over all possible vector fields \(v_t\) that will transport probability mass from \(\rho_0\) to \(\rho_1\), with the Wasserstein distance capturing the minimum possible cost of this transfer. Given two probability distributions \(\rho_0\) and \(\rho_1\), we define this distance to be the minimum of the integral of the norm of the vector field \(v_t\) \[\begin{equation} d^2_{OT} (\rho_0, \rho_1) = \inf \limits_{\rho_t, v_t} \int_0^1 \| v_t \|^2_{L^2(\rho_t)} dt = \mathcal{W}^2_2 (\rho_0,\rho_1), \end{equation}\] under the constraint that the transient probability distribution \(\rho_t\) fulfils the continuity equation \[\begin{equation} \partial_t \rho_t + \nabla \cdot (\rho_t v_t) = 0, \end{equation}\] with \(\rho_0 = \rho^0\) and \(\rho_1 = \rho^1\). This constraint captures how the probability \(\rho_t\) evolves while being pushed along the time dependent vector field \(v_t\). The Wasserstein distance is the minimal energy cost of performing this transformation from \(\rho_0\) to \(\rho_1\). This defines a metric on probability measures, and consequently it induces a geometry on the space of probabilities. (Here, \(v_t\) is the gradient of the local transport map.)