Why S4 is Good at Long Sequence: Remembering a Sequence with Online Function Approximation

Yao Fu

University of Edinburgh

https://franxyao.github.io/

Feb 01 2022

https://embed.notionlytics.com/wt/ZXlKd1lXZGxTV1FpT2lJNE16Wm1ZelUwWVRRNVlXRTBNVE5pT0RRNU9UZGhNalkxTVRNeVpqRXpaaUlzSW5kdmNtdHpjR0ZqWlZSeVlXTnJaWEpKWkNJNklrTnlVbFp3WkVOMWEyRnJNblU1U0hWVVdXUjNJbjA9

The Structured State Space for Sequence Modeling (S4) model achieves impressive results on the Long-range Arena benchmark with a substantial margin over previous methods. However, it is written in the language of control theory, ordinary differential equation, function approximation, and matrix decomposition, which is hard for a large portion of researchers and engineers from a computer science background. This post aims to explain the math in an intuitive way, providing an approximate feeling/ intuition/ understanding of the S4 model: Efficiently Modeling Long Sequences with Structured State Spaces. ICLR 2022

We are interested in the core question:

<aside> 💡 Why S4 is good at modeling long sequences and where does its magic matrix $A$ come from?

</aside>

Skipping all the math, the most straightforward answer is:

<aside> 💡 Because it uses $A$ to remember all the history of the sequence

</aside>

By “remember all the history” (which is mentioned at multiple places in the paper), we actually mean:

<aside> 💡 Because it encodes all the history of the sequence into a hidden state (say a 500-dimensional vector), with this vector, we can literally reconstruct the full sequence (even if the sequence is long, say a 16000 length array).

</aside>

Or to be more concise, we say:

<aside> 💡 With S4, we can use a 500-dimensional hidden state to reconstruct the 16000 length input sequence

</aside>

Or similarly:

<aside> 💡 With S4, we compress the 16000 length input sequence into a 500-dimensional hidden state

</aside>