IntelliCode Compose

Introduction

Main contributions:

Introduce GPT-C, a multi-layer generative transformer model, which is a variant of the GPT-2
Propose and deploy a novel end-to-end code sequence completion system, and an efficient client-side caching system
Introduce MultiGPT-C – a multilingual version of GPT-C

Dataset

1.2 billion lines of source code in Python, C#, JS and TS;

4.7 million files

Approach

Model definition

where $W_e$ is the input token embedding matrix, $C$ is the context vector of tokens, $W_p$ is the position embedding matrix.

The input token embedding matrix is reused in output layout, which removes large fully connected layout and reduces the number of parameters by 25%

Preprocessing

AST/CST/CFG is not used in this paper.

The input code is normalized using a custom tokenizer to overcome the style issues. Also, the token type is extracted in both training and inference procedure to help the normalization, extract subtoken vocabulary and encode the sequences.

Vocabulary

Two tokenization methods:

BPE
casing convention, including camelCase, PascalCase and snake_case

Several special tokens are added to vocabulary: <BOF>, <EOF>, <EOL>, <INDENT>, <DEDENT>

Sensitive Data Processing

To avoid the leaking of sensitive data, IntelliCode Compose uses <NUM_LIT>, <STR_LIT> and <COMMENT> to replace numeric literals, string literals and comments.

Note that not all literals are replaced, there are some literals in each language are kept. For example, <STR_LIT:__main__> means this token is a string literal "__main__".

Model Training

In training, IntelliCode Compose uses a decreasing learning rate with several warm-up perioids. The training is scaled up using synchronous data-parallel distributed training algorithm with local gradient accumulation.

The authors use 5 Lambda V100 boxes, each having sixteen V100 GPUs with 32 GB HBM2 memory, eight 100 GB InfiniBand, and one 100 GB Ethernet connection, managed with Kubernetes

The best model has 24 layers, 16 heads and the bpe vocabulary size is 50000.

The hyper parameters are set as the following table:

In training, base learning rate is set to 6.25*10^-5, batch size 128, learning rate decay of 0.98 per epoch.

Sequence Decoding

Beam search, ending with <EOL> or other language specific tokens

Batched inference is performed, the batch size is set to beam size k, which reduces L*k inference calls to L, where L is the sequence length.

Client-Side Post-Processing

Caching

The suggestion results are cached as a trie with score. The key is the preceding context. If the user continues typing, the trie is pruned and then the suggestion is calculated by traversing the trie using a greedy search.

The trie traversing will be terminated early if none of the child nodes has a score that is larger than the score of its parent multiplied by a ratio 𝑅:

Where L is the position in sequence. $\alpha$ And $k$ are two parameters, $\alpha$ controls the overall average length, and $k$ controls the speed than R decreases. In the practice, the authors select 𝛼 = 0.8 and 𝜅 = 10 to gain a balance between suggestion length and relevance.

Suggestion Processing

The special tokens are removed or replaced case by case.

Multi-Lingual Model

The authors tests four approaches:

Language-agnostic baseline
Language-type embedding
Language control code
Double heads multilingual model

Add a programming language classification task during model pretraining