This project was expanded from the final research project of my Machine Learning course at NYU.
ECE-GY 6143 ML was taught by Fraida Fund and almost all materials about this course are published online (find them here). Let’s say thank you F.F.!
In this project I create a Colab notebook (as required by the course) to do a thorough review on MaxViT and almost all related knowledge.
It’s worth mentioning that I attached detailed PyTorch implementation of almost everything mentioned. Footnotes and citations are also append at the end.
I hope this project can serve as a quick tutorial for someone who wants to learn about Convolutional Neural Networks and Vision Transformers.
Find the online Colab notebook below!
Get a copy of your own to run those cells and have a great trip!
Below is the TOC of the Colab notebook:
- Overview of Topic
- MaxViT
- Multi-axis Attention
- Block Attention
- Grid Attention
- Other techniques used in MaxViT
- MBConv and SE module
- Relative Attention
- Hybrid Vision Models
- Why Hybrid?
- Other Hybrid Models
- Prerequisite Knowledge
- (Vanilla) Transformer
- From Scaled Dot-Product Attention (SDPA) to MHSA
- Scaled Dot-Product Attention
- Multi-Head Attention
- Multi-Head Self-Attention
- FFN
- Residual Connection and LayerNorm
- Positional Encodings
- Encoder vs. Decoder
- Masked (Causal) Self-Attention
- Vision Transformer (ViT)
- From Language to Vision
- Adaptation from Vanilla Transformer
- Design of CNNs
- Depthwise Separatable Convolution
- Bottleneck vs. Inverted Bottleneck
- Macro Design in CNNs