Yao Fu
University of Edinburgh
Feb 01 2022
https://embed.notionlytics.com/wt/ZXlKd1lXZGxTV1FpT2lJNE16Wm1ZelUwWVRRNVlXRTBNVE5pT0RRNU9UZGhNalkxTVRNeVpqRXpaaUlzSW5kdmNtdHpjR0ZqWlZSeVlXTnJaWEpKWkNJNklrTnlVbFp3WkVOMWEyRnJNblU1U0hWVVdXUjNJbjA9
The Structured State Space for Sequence Modeling (S4) model achieves impressive results on the Long-range Arena benchmark with a substantial margin over previous methods. However, it is written in the language of control theory, ordinary differential equation, function approximation, and matrix decomposition, which is hard for a large portion of researchers and engineers from a computer science background. This post aims to explain the math in an intuitive way, providing an approximate feeling/ intuition/ understanding of the S4 model: Efficiently Modeling Long Sequences with Structured State Spaces. ICLR 2022
We are interested in the core question:
<aside> đź’ˇ Why S4 is good at modeling long sequences and where does its magic matrix $A$ come from?
</aside>
Skipping all the math, the most straightforward answer is:
<aside> đź’ˇ Because it uses $A$ to remember all the history of the sequence
</aside>
By “remember all the history” (which is mentioned at multiple places in the paper), we actually mean:
<aside> đź’ˇ Because it encodes all the history of the sequence into a hidden state (say a 500-dimensional vector), with this vector, we can literally reconstruct the full sequence (even if the sequence is long, say a 16000 length array).
</aside>
Or to be more concise, we say:
<aside> đź’ˇ With S4, we can use a 500-dimensional hidden state to reconstruct the 16000 length input sequence
</aside>
Or similarly:
<aside> đź’ˇ With S4, we compress the 16000 length input sequence into a 500-dimensional hidden state
</aside>