Re-learn CNN and ViT from MaxViT

This project was expanded from the final research project of my Machine Learning course at NYU.

ECE-GY 6143 ML was taught by Fraida Fund and almost all materials about this course are published online (find them here). Let’s say thank you F.F.!

In this project I create a Colab notebook (as required by the course) to do a thorough review on MaxViT and almost all related knowledge.

It’s worth mentioning that I attached detailed PyTorch implementation of almost everything mentioned. Footnotes and citations are also append at the end.

I hope this project can serve as a quick tutorial for someone who wants to learn about Convolutional Neural Networks and Vision Transformers.

Find the online Colab notebook below!

Google Colaboratory

https://colab.research.google.com/drive/146soc9wFFGOmqVHODOZ8WqvdDslnjyWP?usp=sharing

Get a copy of your own to run those cells and have a great trip!

Below is the TOC of the Colab notebook:

Overview of Topic

MaxViT

Multi-axis Attention

Block Attention
Grid Attention
Other techniques used in MaxViT

MBConv and SE module
Relative Attention

Hybrid Vision Models

Why Hybrid?
Other Hybrid Models

Prerequisite Knowledge

(Vanilla) Transformer

From Scaled Dot-Product Attention (SDPA) to MHSA

Scaled Dot-Product Attention
Multi-Head Attention
Multi-Head Self-Attention

FFN
Residual Connection and LayerNorm
Positional Encodings
Encoder vs. Decoder

Masked (Causal) Self-Attention

Vision Transformer (ViT)

From Language to Vision
Adaptation from Vanilla Transformer

Design of CNNs

Depthwise Separatable Convolution
Bottleneck vs. Inverted Bottleneck
Macro Design in CNNs