A Self-Attentional Neural Architecture for Code Completion with Multi-Task Learning
Introduction
Current code completion algorithms sufferes from 3 major drawbacks:
-
Hierarchical structural information of the program is not fully utilized
-
In programs, the semantic relationships can be very long
-
Underuse of the information from related tasks
In AST, node's type and value are two closely related attributes - but existing methods predict them separately
Contribution
- Multi-task learning model for code completion
- Exploit the hierarchical structure information of program
- Introduce Transformer-XL to capture long-distance dependencies
Motivating example
- Predicting
break
- Predicting node's type & value
NOTE: borrows the idea from code2vec?
Background
Language modeling
Probability of S:
N-gram:
RNN: capture longer dependencies than N-gram
LSTM/GRU: ease the vanishing gradient problem - uses about 200 context words on average
Transformer: fixed-length context
👍Transformer-XL: captures longer context
Multi-task learning
MTL can improve generalization and reduce the risk of over-fitting
Method: parameter sharing of hidden layers
-
soft parameter sharing: use the distance of parameters
-
hard parameter sharing: hidden layers are shared among tasks, output layers are task-specific
Proposed model
Task: predict the probability of next node in AST
Partial AST encoder: Transformer-XL
Path2root encoder:
MTL: hard parameter sharing
output layers are specfic
Program representation
program -> AST -> sequence of nodes
- in-order depth-first traversal
Node -> type[value] -> $x = [T;V]$
Partial AST encoder
Transformer-XL is used to encode partial AST. It's a standard Transformer-XL.
Path2Root encoder
The path from predicting node to the root node is extracted. In the above example, the extracted path is {BinOp, Return, body, FunctionDef}
in order to predict NameLoad[b]
In this part, a bidirectional-LSTM is used as the encoder. The two hidden vectors are concatenated.
Task-specific output layers
Two tasks: predict next node's type and value
Output layers: the output of partial AST encoder and path2root encoder are concatenated. Softmax is used to generate the prediction
Training
loss
Because there are multiple tasks, the final loss is calculated using a weighted sum over the task-specific losses
Experiments
Dataset & Metrics
dataset: java/python/js
Metrics: accuracy
Experimental setup
Vocab size: 50000
embedding size: type[300], value[1200], total[1500]
6-layer Transformer-XL, 6 heads, d_h = 64
Segment length = 50