Re-learn CNN and ViT from MaxViT

Re-learn CNN and ViT from MaxViT

Published
December 20, 2023
Tags
Python
Deep Learning
This project was expanded from the final research project of my Machine Learning course at NYU.
ECE-GY 6143 ML was taught by Fraida Fund and almost all materials about this course are published online (find them here). Let’s say thank you F.F.!
 
In this project I create a Colab notebook (as required by the course) to do a thorough review on MaxViT and almost all related knowledge.
It’s worth mentioning that I attached detailed PyTorch implementation of almost everything mentioned. Footnotes and citations are also append at the end.
I hope this project can serve as a quick tutorial for someone who wants to learn about Convolutional Neural Networks and Vision Transformers.
 
Find the online Colab notebook below!
Get a copy of your own to run those cells and have a great trip!
 
Below is the TOC of the Colab notebook:
  • Overview of Topic
    • MaxViT
      • Multi-axis Attention
        • Block Attention
        • Grid Attention
        • Other techniques used in MaxViT
          • MBConv and SE module
          • Relative Attention
    • Hybrid Vision Models
      • Why Hybrid?
      • Other Hybrid Models
  • Prerequisite Knowledge
    • (Vanilla) Transformer
      • From Scaled Dot-Product Attention (SDPA) to MHSA
        • Scaled Dot-Product Attention
        • Multi-Head Attention
        • Multi-Head Self-Attention
      • FFN
      • Residual Connection and LayerNorm
      • Positional Encodings
      • Encoder vs. Decoder
        • Masked (Causal) Self-Attention
    • Vision Transformer (ViT)
      • From Language to Vision
      • Adaptation from Vanilla Transformer
    • Design of CNNs
      • Depthwise Separatable Convolution
      • Bottleneck vs. Inverted Bottleneck
      • Macro Design in CNNs