In quantum physics, accurately simulating and predicting the behavior of particles is a computationally challenging task due to the curse of dimensionality. The computational complexity grows exponentially as the number of particles in the system increases, making it difficult to study large-scale quantum systems using traditional methods.
Enter Deep Stochastic Mechanics (DSM), a novel approach that leverages deep learning to simulate quantum dynamics efficiently. It is a neural network(NN)–based method that directly samples from the probability density of the wave function, bypassing the need to estimate the wave function itself explicitly.
At the heart of quantum mechanics lies the Schrödinger equation (SE) for $0 < t \le T$ and $\forall x\in \mathbb{R}^d$:
\[i \hbar \partial_{t} \psi (x, t) = \Big[-\frac{\hbar^2}{2m} \frac{\partial^2}{\partial x^2} + V(x, t)\Big] \psi(x, t),\]given an initial condition
\[\psi(x, 0) = \psi_{0}(x),\]where $m$ is a particle’s mass, $V(x, t)$ is a potential funtion that describes physics, $\psi(x, t): \mathbb{R}^d \times [0, T]\rightarrow \mathbb{C}$ is a wave function.
The probability density of finding a particle at position $x$ at time $t$￼is
\[\rho(x,t) = |\psi (x, t)|^2.\]One of the possible solutions is to directly solve the SE for $\psi (x, t)$ using, for example, finite difference methods. Another approach is Monte-Carlo methods which rely on random sampling. They use a variational ansatz (a parametrized wave function) to approximate the true wave function. Existing methods for solving the time-dependent SE face significant challenges:
What if we can directly sample from the density $\vert \psi (x, t)\vert^2$ without estimating the wave function $\psi(x, t)$?
DSM takes a different approach by leveraging Nelson’s stochastic mechanics [Nelson, 1966], which establishes an equivalence between the time-dependent Schrödinger equation and a diffusion process. Assuming $\psi (x, t) = \sqrt{\rho(x, t)}e^{iS(x, t)}$, we define
\[\begin{align*} \text{ current velocity: } v(x, t) &= \frac{\hbar}{m} \nabla S(x, t), \\ \text{ osmotic velocity: } u(x, t) &= \frac{\hbar}{2m} \nabla \log \rho(x, t). \end{align*}\]Our method relies on the following stochastic process:
\[\mathrm{d}{\color{2D9090}X(t)} = \Big( {\color{982715}v} \big( {\color{2D9090}X(t)}, t \big)+ {\color{982715}u} \big({\color{2D9090}X(t)}, t \big) \Big)\mathrm{d}t + \sqrt{\frac{ \hbar}{m} }\mathrm{d} W, \qquad {\color{2D9090}X(0)} \sim \big|\psi_{0}\big|^2,\]which corresponds to sampling from $\rho = \vert \psi (x, t)\vert^2$; where $u$ is an osmotic velocity, $v$ is a current velocity and $\overset{\rightarrow}{W}$ is a standard (forward) Wiener process. Process $X(t)$ is called the Nelsonian process.
We parametrize velocities $u, v$￼ via NNs, yielding a new process ${\color{2D9090}X^\theta(t)} \in \mathbb{R}^d$￼ that approximates the true process ￼$X(t)$:
\[\mathrm{d}{\color{2D9090}X^\theta(t)} = \Big({\color{982715}v_{\theta}} \big({\color{2D9090}X^\theta(t)}, t \big)+ {\color{982715}u_{\theta} }\big({\color{2D9090}X^\theta(t)}, t \big) \Big)\mathrm{d}t + \sqrt{\frac{ \hbar}{m} }\mathrm{d} {W}.\]After integration over time, we get
\[{\color{2D9090}X^\theta_{i+1}} = {\color{2D9090}X^\theta_{i}} + \big({\color{982715}v_{\theta}}({\color{2D9090}X^\theta_{i}}, t_{i})+ {\color{982715}u_{\theta}}({\color{2D9090}X^\theta_{i}}, t_{i}) \big)\epsilon + z,\]where $\epsilon > 0$ is a time step size, $0 \le i < \frac{T}{\epsilon}$, and ￼$z \sim \mathcal{N}\big(0, \frac{\hbar}{m} \epsilon I_{d}\big)$.
Given trained velocities $u_\theta, v_\theta$, and the initial condition $X_0 \sim \vert \psi_{0}\vert^2$, we can produce samples from ￼$\rho$.
The Schrödinger equation tells us the velocities should satisfy
\[\begin{align} \partial_{t} v_\theta &= -\frac{1}{m} \nabla V + \langle u_\theta, \nabla u_\theta \rangle - \langle v_\theta, \nabla v_\theta \rangle + \frac{\hbar}{2m} \nabla \big(\text{div } u_\theta \big) &&&& \label{eq1}\ \\ \partial_{t} u_\theta &= - \nabla \langle v_\theta, u_\theta\rangle - \frac{\hbar}{2m} \nabla \big(\text{div } v_\theta \big)&&&& \label{eq2} \end{align}\]where $\nabla = \Big(\frac{\partial}{\partial x_{1}} , \ldots,\frac{\partial}{\partial x_{d}} \Big)$ is a gradient, $\langle \cdot , \cdot \rangle$ is a scalar product, $\text{div } f(x) = \sum_{i=1}^d \frac{\partial}{\partial x_i}f(x)$ is a divergence operator.
Additionally, the initial velocities should follow the initial conditions
\[v_\theta(x, 0) = \frac{\hbar}{m}\nabla S_0(x) \quad \text{and} \quad u_\theta(x, 0) = \frac{\hbar}{2m} \nabla \log \rho_0(x) \label{eq:ic}\]These equations (\ref{eq1}), (\ref{eq2}) and (\ref{eq:ic}) define
\[\begin{align} \mathcal{L}_1 (v_{\theta}, u_{\theta}) &= \Big\| \partial_{t} v_\theta +\frac{1}{m} \nabla V - \langle u_\theta, \nabla u_\theta\rangle + \langle v_\theta, \nabla v_\theta\rangle - \frac{\hbar}{2m} \nabla \big(\text{div } u_\theta \big) \Big\|_2, \\ \mathcal{L}_2 (v_{\theta}, u_{\theta}) &= \Big \| \partial_{t} u_\theta + \nabla \langle v_\theta, u_\theta\rangle + \frac{\hbar}{2m} \nabla \big(\text{div } v_\theta \big) \Big \|_2,\\ \mathcal{L}_3 (v_{\theta}, u_{\theta}) &= \| u_\theta (x, 0) - u_0(x) \|_2 + \| v_\theta (x, 0) - v_0(x) \|_2 \end{align}\]Then, our loss function to minimize is
\[\mathcal{L} (v_{\theta}, u_{\theta}) = \sum_{i=1}^3 \mathcal{L}_i (v_{\theta}, u_{\theta}).\]Theorem (Strong convergence bound) We have the following bound between the processes $X$ (the Nelsonian process) and $X^\theta$ (its approximation with $u_\theta, v_\theta$): |
where the constant $C_T$ depends on a time horizon $T$ and Lipschitz constants of $u, v, u_\theta, v_\theta$. |
This theorem means that optimizing the loss leads to a convergence of the neural process $X^\theta$ to the Nelsonian process $X$, and that the loss value directly translates into an improvement of error between the processes.
Interacting bosons in a harmonic potential:
\[\begin{align*} V(x, t) = \sum_i \frac{1}{2} m \omega^2 x_i^2 + \frac{1}{2} g \sum_{i, j} \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-(x_i - x_j)^2 / 2 \sigma^2}, \end{align*}\]with an initial condition
\[\begin{align*} \psi(x, 0) = e^{-\omega^2x^2/(2\hbar)}, \end{align*}\]where $g$ controls the interaction strength.
Let’s try to run the simulation for more particles:
There are more experiments, including scaling studies, in our full DSM paper.
Developed the new efficient computational method for simulating quantum dynamics based on Nelson’s stochastic mechanics
Adaptive to latent low-dimensional support of density
Since our DSM algorithm is a new approach for simulating quantum dynamics (solving time-dependent Schrodinger equation), which could be an alternative to t-VMC methods, there are still some challenges to resolve. For example:
Nelson, Edward. “Derivation of the Schrödinger equation from Newtonian mechanics.” Physical review 150.4 (1966): 1079.
Raissi, Maziar, Paris Perdikaris, and George E. Karniadakis. “Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations.” Journal of Computational physics 378 (2019): 686-707.
Carleo, Giuseppe, et al. “Unitary dynamics of strongly interacting bose gases with the time-dependent variational monte carlo method in continuous space.” Physical Review X 7.3 (2017): 031026.
One of the key components of natural language processing (NLP) models is embedding layers, which transform input words into real vectors. This can be represented as a lookup table (or a matrix). The large vocabulary leads to enormous weight matrices. State-of-the-art NLP networks are large, with millions to billions of parameters. However, computational resources are oftern limited, which is an essential problem in NLP research. What can we do about that?
The purpose of tensor decompositions is to represent a given tensor as a product of smaller tensors called cores with fewer parameters while preserving important information.
Tensor decompositions, such as Tucker decomposition, canonical decomposition, and Tensor Train (TT) decomposition (1), can be applied for dimensionality reduction in a varity of tasks. For instance, signal and data compression, or compression of neural networks layers. In the last case, model parameters are factorized into smaller cores of the corresponding tensor decomposition. For example, TT decomposition was utilized for a compression of a linear layer (2), what was extended to a compression of convolutional layer with canonical decomposition (3). The same holds for Tensor Ring (TR) decomposition (4).
Here, I’d like to show how TT and TR decompositions can be used to compress the embedding layer. $\def\uuX{\underline{\bf X}}$ $\def\uuG{\underline{\bf G}}$ $\newcommand\R{\mathbb{R}}$ $\newcommand\bG{\bf G}$ $\newcommand\bX{\bf X}$ $\newcommand\bU{\bf U}$ $\newcommand\bV{\bf V}$
Suppose we have a $N$th-order tensor $\uuX \in \R^{I_1 \times I_2 \times \dots \times I_N}$. The TT representation of $\uuX$ is given as
\[x_{i_1, i_2, \dots, i_N} = \sum_{r_1=1}^{R_1} \sum_{r_1=2}^{R_2} \dots \sum_{r_{N-1}=1}^{R_{N-1}} g^{(1)}_{1, i_1, r_1} \cdot g^{(2)}_{r_1, i_1, r_2} \cdot \dots \cdot g^{(N)}_{r_{N-1}, i_N, 1},\]or, equivalently,
\[x_{i_1, i_2, \dots, i_N} = {\bG}^{(1)}_{i_1} \cdot {\bG}^{(2)}_{i_1} \cdot ... \cdot {\bG}^{(N)}_{i_N},\]where slice matrices are defined as \({\bG}_{i_n}^{(n)} =\) $\uuG^{(n)}(:, i_n, :) \in \mathbb{R}^{R_{n-1} \times R_n}, i_n = 1, 2, \dots, I_N$ with $\uuG^{(n)}$ being the $i_n$th lateral slice of A core tensor $\uuG^{(n)} \in \mathbb{R}^{R_{n-1}\times I_n \times R_n},$ $n=1, 2, \dots,N$ and $R_0 = R_N = 1$ by definition.
The key idea of TT decomposition is demonstrated in the next figure. The minimal values of ${R_k}_{k=1}^{N-1}$ are called TT–ranks for which the TT–decomposition exists.
The total number of parameters in TT decomposition can be evaluated as $\sum_{k=1}^N R_{k-1} I_k R_{k}$. Hence, if there are core tensors with small ranks, the total number of elements required to represent a given tensor in TT–format is significantly smaller than the number of elements in a full tensor $\sum_{k=1}^N I_k$. This remark makes the application of TT decomposition appealing in a lot of problems related to extremely large data.
The tensor ring format of a tensor $\uuX \in \mathbb{R}^{I_1 \times \cdots \times I_N}$ is defined as
\[x_{i_1, i_2, \dots, i_N} = \text{Trace}\left( \bG^{(1)}_{i_1} \cdot \ldots \cdot \bG^{(N)}_{i_N} \right),\]or in index-form
\[x_{i_1, i_2, \dots, i_N} = \sum_{r_0 = 1 }^{R_{0}} \cdots \sum_{r_{N-1} = 1 }^{R_{N-1}} g^{(1)}_{r_0, i_1, r_1} \cdot \ldots \cdot g^{(N)}_{r_{N-1}, i_N, r_0},\]where \({\bG}^{(n)}_{i_n}\) is an $i_n$th slice matrix of a tensor $\uuG^{(n)}$ $\in \R^{R_{n-1}\times I_n \times R_n}$. The last latent tensor $\uuG^{(N)}$ is of size $R_{N-1} \times I_N \times R_0$, i.e., $R_{N} = R_0$.
The TR-format can be seen as a natural extension of the TT decomposition where $R_0=R_N=1$. The illustration of TR-format is given in next figure.
However, the TR-format is known to have theoretical drawbacks compared to TT decomposition (5). For example, it was found that in case of TR decomposition, minimal TR-ranks for a tensor need not be unique (6) (not even up to permutation of the indices $i_1, \dots , i_N$), resulting in problems in their estimation. On the other hand, numerical experiments show that the TR-format leads to lower ranks of the core tensors compared to the TT-format (7), which means higher compression ratios and lower storage costs.
We aim to replace a regular embedding matrix with a more compact, yet powerful and trainable, format which would allow us to efficiently transform input words into vector representations.
Let $\bX \in \mathbb{R}^{I \times J}$ be a matrix of size $I \times J$. The goal is to get natural factors of its dimensions $I = \prod_{n=1}^N I_n$ and $J = \prod_{n=1}^N J_n$ and then reshape this matrix to $N$th-order tensor $\uuX \in \mathbb{R}^{I_1 J_1 \times I_2 J_2 \times \dots \times I_N J_N}$ whose $n$-th dimension is of length $I_n J_n$ and is indexed by the tuple $(i_n , j_n)$. We also can treat this procedure as the bijection that map rows and columns of the original matrix to the $N$-dimensional vector-indices. Than TT decomposition according to Eq. (1) is applied to this tensor to get a compact representation:
\[\uuX((i_1, j_1), (i_2, j_2), \dots, (i_N, j_N)) = \uuG^{(1)}((i_1, j_1), :) \ldots \uuG^{(N)}(:, (i_N, j_N)).\]The described representation of a matrix in the TT–format is called a TT–matrix. The obtained factorizations $(I_1, I_2, \dots I_N ) \times (J_1,J_2, \dots J_N)$ will be treated as shapes of a TT– matrix, or TT–shapes. The idea of constructing the TT– matrix from a given matrix is showed in next figure for a 3-dimensional tensor.
Similarly, we can define a TR-matrix by reshaping a given matrix $\bX$ into a tensor $\uuX \in \mathbb{R}^{I_1 J_1 \times I_2 J_2 \times \dots \times I_N J_N}$:
\[\uuX((i_1, j_1), (i_2, j_2), \dots, (i_N, j_N)) = \text{Trace}(\uuG^{(1)}((:,i_1, j_1), :) \ldots \uuG^{(N)}(:, (i_N, j_N), :)).\]A concept of building the TR– matrix from the given matrix is showed in next figure for a 3-dimensional tensor.
Now we can introduce a concept of a tensorized embedding layer:
A TT/TR-embedding layer is a layer where TT/TR–cores are trainable parameters, and they are represented as a TT/TR–matrix which can be transformed into an embedding layer $\bX \in \mathbb{R}^{I \times J}$. The algorithm requires to set the ranks in advance to define the cores size, and they are considered to be hyperparameters of the layer. The ranks values are crucially important since they determine and control the compression ratio.
To obtain an embedding for a specific word indexed $i$ in a vocabulary, we transform a row index $i$ into an N-dimensional vector index $(i_1; : : : ; i_N)$, and compute components of TT or TR embedding. Note, that the evaluation of all its components is equal to choosing the specific slices and running a sequence of matrix multiplications, which is implemented efficiently in modern linear algebra modules.
Let me show results on a simple task – sentiment analysis. Sentiment analysis refers to predicting a polarity of a sentence.
The proposed approach is compared with the following baselines:
We test our approach on popular datasets such as the IMDB dataset with two classes, and the Stanford Sentiment Treebank (SST) with five classes. Our model consists of a standard bidirectional two-layer LSTM with a hidden size of 128 and a dropout rate of 0.5. For the embedding layer, we used the most frequent 25,000 words for IMDB and 17,200 for SST, and transformed them into a J-dimensional space with a regular embedding layer or a TT/TR embedding layer.
The results of our experiments reveal that the models with the compressed embedding layer performed similarly or even better than the models with standard embedding layers. For example, on the IMDB dataset, the TT embedding layer with a rank of 16 and a test accuracy of 89.7% outperformed our baseline model with a test accuracy of 88.6%. Furthermore, the compressed model had significantly fewer parameters than the full model (7.19 million vs less than a million). Similarly, on the SST dataset, the model with the TR-embedding layer outperformed both the model with the regular embedding layer and the TT layer. In the case of matrix low-rank factorization, we would obtain compression ratios $\frac{J}{R} = \frac{256}{8} =32$ or $\frac{256}{16}= 16$ which are definitely worse compared to tensor factorization techniques.
The obtained slightly better test accuracy of the models with tenzorized embedding layers suggests that imposing specific tensorial low–rank structure on the matrix of embedding layer can be considered as a particular case of regularization, thus, potentially the model generalize better.
To conclude, TT and TR decompositions can be used to compress neural networks. We use them to compress embedding layers in NLP models. This method can be easily integrated into any deep learning framework and trained via backpropagation, while capitalizing on reduced memory requirements and increased training batch size. More details can be found in the paper and code is available here.