Intro

BERT is a language model pre-trained with unlabeled textual data, such as Wikipedia (2.5 billion words) and BooksCorpus (800 million words), implemented using Transformers learned in the previous presentation.

BERT was able to achieve high performance because it had a model pre-trained with vast amounts of unlabeled data and re-tuned hyperparameters with additional training in other labeled tasks, referencing existing examples of high performance with this model. Additional training courses for parameter readjustment for other tasks are called fine-tuning.

Structure

Untitled

BERT-Base : L=12, D=768, A=12 : #110M parameter
BERT-Large : L=24, D=1024, A=16 : #340M parameter

Here, BERT-base has the same hyperparameter as Open AI GPT-1, which preceded BERT, because BERT researchers designed BERT-Base to be equivalent in size to GPT-1 to directly compare performance with GPT-1. On the other hand, BERT-Large is a model designed to show the maximum performance of BERT. Most of the records set by BERT were made through BERT-Large.

Embedding method

contextual embedding

Untitled

BERT is basically a 12 stack of Transformer encoders, so internally, we're performing multi-head self-attention and positionwise feed forward neural networks on each floor.

Untitled

The output embeddings after the operation of BERT reflect the context in which all the context of the sentence is referenced. In the figure on the left above, the vector [CLS] was simply an embedding vector past the embedding layer at the time of the input embedding that would have been used as the initial input for BERT, but after passing BERT, all the word vectors [CLS], I, love, you] are referenced and then become a vector with context information.

positional embedding

Positional embedding use the positional sinusodial

Untitled

The figure above shows how to use position embedding. First of all, in the figure above, WordPiece Embedded is a practical input as a word embedding that we already know. And you need to add location information to this input through position embedding. The idea of position embedding is very simple, using one more embedding layer for location information.

Intro

Structure

Embedding method

Pre-training