An Image is Worth 16 x 16 Words: Transformers for Image Recognition at Scale | Notion

Transformer를 vision task에 어떻게 적용할 수 있을까

1. Introduction

Transformers

Untitled

transformer는 자연어 처리(NLP)에서 각광받고 있는 model이다.

attention revisited

Untitled

Untitled

In NLP, transformer has become model of the choice

In computer vision however.. CNNs remain dominant

combining CNN-like architecture with self-attention
replacing the convolutions entirely → 이론적으로는 efficient하지만 modern hardware accelerator에 적합하지 않다는 단점이 있음

In this paper:

standard transformer를 image에 direct하게 적용하고자 한다. 이미지를 patch 단위로 자르고 patch에 해당하는 linear embedding을 생성하여 (마치 NLP에서 token(sub-word)에 하는 것처럼) transformer에 input으로 넣어준다.
Image classification task에 대해 supervised fashion으로 model을 train했다.