Python知识分享网 - 专业的Python学习网站 学Python,上Python222
Transformer论文原文Attention is all you need PDF 下载
匿名网友发布于:2025-12-01 09:50:35
(侵权举报)
(假如点击没反应,多刷新两次就OK!)

Transformer论文原文Attention is all you need PDF 下载 图1

 

 

资料内容:

 

1Introduction
Recurrent neural networks,long short-term memory 3and gated recurrent neural networks
inparlhabeasdasatoftaarachismla
transductionproblemssuchaslanguagemodelingandmachinetranslationB.Numerous
efforts have since continued to push the boundaries of recurrent language models and encoder-decoder
architectures[24
Recurrent models typically factor computation along the symbol positions of the input and output
sequences. Aligning the positions to steps in computation time,they generate a sequence of hidden
aprvto
sequntanatupcldspaallizatiowititaininxalwichbcoscriticalatloe
sequenclnamoconsatbahacsseactorkhasac
significantimprovementsincomputationalefficiencythroughfactorizationtricksandconditional
computation[B2], while also improving model performance in case of the later. The fundamental
constraint of sequential computation,however,remains.
Attention mechanisms have becomean integral part of compelling sequence modeling and transduc-
tion models in various tasks,allowing modeling of dependencies without regard to their distance in
the inputor output sequencesI9In allbutafew cases 2Z, however,such attention mechanisms
are used in conjunction with a recurrent network.
InthisworkweproposetheTransformeramodelarchictureeschewingrecurrenceandinstad
relying entirely on an attention mechanism to draw global dependencies between input and output.
aoasas
translation quality after being trained for as litte as twelve hourson eight PlooGPUs.

 

2 Background
The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU
, ByteNet[Iand ConvS2S, all of which use convolutional neural networks as basic building
haoala
thenumber of operationsrequiredtorelatesignals from twoarbitraryinput or output positions grows
inthdianbslealadgarticrisa
it moredifficult to learn dependencies between distantpositions [2. In theTransformer this is
reduced to a constant number of operations, albeit at the cost of reduced effective resolution due
to averagingattention-weighted positions,an effect we counteract with Multi-Head Attention as
described in section 32
Self-attention,sometimescalled intra-attentionis anatentionmechanismrelatingdiferentpositions
of a single sequence in order to compute a representation of the sequence.Self-attention has been
usedsssvarioadiraisiaaiaza
textual entailment and learning task-independent sentence representations272822
End-to-end memory networks are based on a recurrent attention mechanism instead of sequence-
aligned recurrenceand havebeen shown to performwell on simple-languagequestion answeringand
language modeling tasks [
Tothebestofourknowledge,however,theTransformeristhefirsttransdutionmodelrelying
tlaa
alignedsorconvotinthfolwigconswwilldecrietraoemtia
self-attentionanddiscuss its advantagesover modelssuchas[Iand

 

3 Model Architecture
Most competitive neural sequence transduction modelshave an encoder-decoder structure国3
Here, the encoder maps an input sequence of symbol representations(1,, n) to a sequence
of continuous representations(21,,2)Given,the decoder thengenerates an output
sequence(mofsymbolsoneelmentatatime.Ateachstepthemodelisauto-regressiv
I], consuming the previously generated symbols as additional input when generating the next.