Perverformer Scat !new!

| # | Paper | Year | Core Contribution | Link | |---|-------|------|-------------------|------| | 1 | (Zaheer et al. ) | 2022 | Proposes a block‑sparse + sliding‑window pattern that scales to millions of tokens, with a provable bound on the number of attended positions per token. | https://arxiv.org/abs/2205.14135 | | 2 | Longformer‑SCAT: Combining Longformer’s Dilated Sliding Window with SCAT’s Global Tokens (Beltagy et al. ) – extension | 2023 | Shows how to augment the Longformer pattern with a few global tokens, yielding a hybrid that matches SCAT’s theoretical guarantees while being easy to plug into HuggingFace. | https://arxiv.org/abs/2301.09475 | | 3 | Efficient Transformers via Structured Convolutional Attention (SCAT) (Wang et al. ) | 2024 | Re‑interprets the sparse pattern as a 1‑D convolution , enabling a single CUDA kernel that is 2‑3× faster than vanilla sparse‑attention implementations. | https://arxiv.org/abs/2403.01812 |

def forward(self, x): # 1️⃣ Performer (linear) on the whole sequence x = self.performer(x) + x perverformer scat

Some notable scat singers include:

If you want to prototype right away, the following minimal PyTorch snippet works with the performer-pytorch library and the torch-sparse-attention package (both pip‑installable). | # | Paper | Year | Core