自动文本摘要生成是自然语言处理领域的一个經典的研究課題，主要任務是將較長的文本在盡可能保留原意的基礎上生成一個較短的版本（在文本長度和文本信息中保持平衡）。在近兩年較多採用深度學習的方法。自動文本摘要的技術能夠幫助人們快速的瀏覽和提取有用的信息，在互聯網信息爆炸的今天有着广泛的应用前景。在此記錄近兩年有關自動摘要的paper，不定時更新。

Automatic Summarization Paper Note

Sentence summarization

1. A Neural Attention Model for Abstractive Sentence Summarization

被引用次数：601
Alexander M. Rush et al., Facebook AI Research/Harvard
EMNLP2015

Abstract

seq2seq模型在2014年提出，這篇論文是將seq2seq模型應用在abstractive summarization任務上比較早期的論文。同組的人還發表了一篇NAACL2016（Sumit Chopra, Facebook AI Research_Abstractive sentence summarization with attentive recurrent neural networks）（作者都差不多），在這篇的基礎上做了更多的改進，效果也更好。這兩篇都是在abstractive summarization任務上使用seq2seq模型的經典baseline。

Objective function: negative log likelihood
Optimization: mini-batch SGD

3 encoder:

bag-of-words encoder
Conv encoder: 參考TextCNN，沒有做過多的其他改動
Attention-based encoder:
x: 原始文本
y_c: 上下文單詞（已經輸出的摘要內容）

性能(ABS)

output use Beam Search algorithm.
本模型效果並不讓人滿意.

Models	Rouge-1	Rouge-2	Rouge-L
DUC-2004	26.55	7.06	22.05
Gigaword	30.88	12.65	28.34

2. Abstractive Sentence Summarization with Attentive Recurrent Neural Networks

Sumit Chopra et al., Facebook AI Research
NAACL2016

encoder: 使用了基於註意力的CNN

先將詞的原始embedding(x_i)和位置embedding(l_i)（可訓練）相加，作為詞的full embedding(a_i).
full embedding(a_i) = embedding(x_i)+embedding(l_i)
convolution:size=5 -> aggregate embedding(z_i)

decoder: RNN/LSTM

模型encoder的輸入每次都是一個完整地句子，decoder每次要輸出的時候，會將h_t-1給encoder，encoder根據句子和h_t-1計算attention生成c_t給decoder，然後decoder根據(y_t-1, h_t-1, c_t)計算要輸出的單詞。encoder還要更新position embedding(l_i)

性能(RAS-Elman, k=10, k means beam size):

Models	Rouge-1	Rouge-2	Rouge-L
DUC-2004	28.97	8.26	24.06
Gigaword	33.78	15.9	31.15

3. Selective Encoding for Abstractive Sentence Summarization

Qingyu Zhou, Nan Yang, Furu Wei, Ming Zhou; MSRA&HIT
ACL2017

Encoder
BiGRU: 為每個詞x_i生成一個2d維的hidden state(h_i)
Selective Mechanism
將詞的h_i與句子的s拼接到一起，MLP裏生成輸出h‘_i。
s=[h←1, h→n]: h←1表示從右到左讀取了整個句子, h→n表示從左到右讀取了整個句子。
Decoder
st = GRU(s_t-1, c_t-1, y_t-1)
r= maxout(c_t，s_t, y_t-1)(k=2)

性能: all are state-of-the-art

Models	Rouge-1	Rouge-2	Rouge-L
DUC-2004	29.21	9.56	25.51
Gigaword	36.15	17.54	33.63

Gigaword(ours): Rouge-1:46.86/Rouge-2:24.58/Rouge-L:43.53(so high)

Text summarization(multi-sentence summaries)

1. Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond

Ramesh Nallapati, Bowen Zhou, Cicero dos Santos; IBM
CoNLL2016

Many tricks:

LVT/sampled softmax: seq2seq輸出的時候使用了softmax，計算V中的每個詞的值並歸一化，這一步非常耗時。sampled softmax對每個句子/文章單獨采樣了一個V‘，只對V‘中的詞計算softmax並歸一化，大幅減少了訓練時的計算量。不過在測試的時候仍然需要計算所有詞。
Feature-rich encoder: 將POS、NER、TF、IDF等文本特征拼接在word embedding後面作為encoder的輸入。
Switching Generator-Pointer: 這個操作主要用於解決OOV和UNK問題，當碰到OOV時，g_i置為0，模型會從輸入中尋找一個詞作為輸出和下一時間片的輸入。測試時模型會自動決定采用decoder的輸出還是從輸入中挑選一個詞作為輸出。
Hierarchical Attention: 模型會對每個句子計算attention，並整合句子的權重計算每個詞的權重。句子的的隱層狀態後面還會拼接position embedding。
Hierarchical Attention效果沒有預期的好，作者還使用了Temporary Attention(Sankaran et al., 2016, Temporal Attention Model for Neural Machine Translation)，效果大幅提升。

DataSet: 本文提出了CNN/Daily Mail Corpus，每個摘要包含了多個句子（之前的DUC2004和Gigaword每個摘要只包含1個句子），後續被大量用於評測。

性能

Models	Rouge-1	Rouge-2	Rouge-L
CNN/Daily	35.46	13.30	32.65
Gigaword	35.30	16.64	32.62

2. Incorporating copying mechanism in sequence-to-sequence learning

被引用次数：208
Jiatao GU et al.
ACL2016
using LCSTS Dataset
Models
整體:

這篇論文則將概率相加再softmax得到輸出。對於V中的每個詞，計算generation模式的概率，對於X中的每個詞，計算copy模式的概率，最後進行歸一化，得到輸出。

Decoder State Update: s_t=f(s_t-1, y_t-1, c_t)這個和常規的是一樣的，但是這裏的y_t-1=[e(y_t-1), C(y_t-1)]T，e就是y_t-1的embedding，C是輸入單詞的權重，對跟y_t-1相同的詞進行計算，不相同的詞直接置0，然後歸一化。
Code: https://github.com/MultiPath/CopyNet
性能:
| Models | Rouge-1 | Rouge-2 | Rouge-L |
| ———— | ———— | ———— | ——— |
| LCSTS(Word Level) | 35.0 | 22.3 | 32.0 |

3. Get To The Point: Summarization with Pointer-Generator Networks

ACL17

相比於上一個將概率相加的方式，這篇paper使用一個開關控制copy的方式。上图展示了 decoder 的第三个 step，预测 Germany beat 后面一个单词，像之前一样，我们会计算一个 attention distribution 和 vocabulary distribution，不过同时，也会计算一个产生概率 pgen，是 0-1 间的一个值，表示从 vocabulary 中产生一个单词的概率，相应的 1−pgen 就是从输入中 copy 一个单词的概率。

Coverage
用 attention distribution 来记录哪些部分已经总结过了，来惩罚再次加入计算的部分。decoder 的每个时刻，有一个 coverage vector ct 来将记录之前所有时刻 attention 的总和。

性能

pointer-generator	Rouge-1	Rouge-2	Rouge-L
CNN/Daily	36.44	15.66	33.42

pointer-generator + converage	Rouge-1	Rouge-2	Rouge-L
CNN/Daily	39.53	17.28	36.38

4. Sequential Copying Networks

Qingyu Zhouy, Nan Yang, Furu Wei, Ming Zhou; HIT & MSRA
AAAI2018
原先的CopyNet每次copy一個詞，這篇文章一次可以copy多個詞（詞組），通過給每個copy的詞打標簽來判斷是否結束。

性能

Models	Rouge-1	Rouge-2	Rouge-L
Gigaword	35.93	17.51	33.35

5. A Deep Reinforced Model for Abstractive Summarization

Romain Paulus, Caiming Xiong & Richard Socher; Salesforce Research
ICLR2018
Reinforcement Learning

Models(Attention Mechanism)

Intra-Temporal Attention
Intra-Decoder Attention
普通的Attention都是計算當前decoder狀態與不同encoder之間的權重。然而，已經由decoder輸出的詞對decoder輸出下一個詞同樣是有影響的，例如可以避免輸出陷入循環。因此這篇文章設計了一個decoder當前隱層狀態與decoder歷史隱層狀態的attention機制。先計算當前h_t與歷史每個h_t‘的得分，然後歸一化得到權重，最後加權求和得到decoder的上下文輸出c_t_d。
Generation/Pointer
Models(Hybrid Learning Objective)
Teacher Forcing算法。通常的目標函數都是NLL，然而即便達到最優，在使用離散評價方式（ROUGE, CIDEr, BLEU）進行評價的時候，往往得不到最好的效果，主要原因有倆：1. 訓練的時候使用正確的單詞作為下一時間片的輸入，每一步的錯誤不會累積，但是測試的時候每一步的錯誤會累積到下一時間片。2. 輸出文本的詞序是靈活的，離散的評價方式也考慮到了這種靈活性，但是NLL沒有。
Policy Learning: 強化學習的兩種常用算法（Policy Gradient, Q-Learning）之一。本文使用self-critical policy gradient training algorithm (Rennie et al., 2016)。

性能

Mail(only ML)	Rouge-1	Rouge-2	Rouge-L
CNN/Daily	38.30	14.81	35.49

Mail(only RL)	Rouge-1	Rouge-2	Rouge-L
CNN/Daily	41.16	15.75	39.08

Mail(ML+RL	Rouge-1	Rouge-2	Rouge-L
CNN/Daily	39.87	15.82	36.90

6.Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting

Yen-Chun Chen, Mohit Bansal; UNC Chapel Hill
ACL2018
Extractor + Abstractor + Reinforcement Learning

decoder: CopyNet-like

快: parallel decoding
強化學習僅調整Extractor參數，不調整Abstractor參數，避免生成的句子可讀性差，同樣使用Policy Gradient學習算法。

Models	Rouge-1	Rouge-2	Rouge-L
CNN/Daily	40.88	17.80	38.54

7. A Reinforced Topic-Aware Convolutional Sequence-to-Sequence Model for Abstractive Text Summarization

Li Wang1, Junlin Yao2, Yunzhe Tao3, Li Zhong1, Wei Liu4, Qiang Du3
1 Tencent Data Center of SNG
2 ETH Zurich
3 Columbia University
4 Tencent AI Lab
IJCAI-ECAI2018
Conv Seq2seq + LDA + Reinforcement Learning

Models

Position Embedding: 每個詞的輸入=詞向量+位置向量。

Topic Embedding: 對於每個主題，抽取出前N個詞出來構成詞表K，預訓練得到topic embedding。對於輸入中的每個詞，如果在K中，則使用topic embedding，否則使用word embedding。
Joint Attention: 在Topic-aware Conv過程中，計算Attention權重時，除了要計算當前decoder隱層狀態與每個encoder輸出的點積，還要計算當前decoder隱層狀態與input ebmedding中每個encoder輸出的點積，再求和並歸一化作為權重。
Reinforcement Learning: 同A Deep Reinforced Model for Abstractive Summarization。

性能

Models	Rouge-1	Rouge-2	Rouge-L
DUC2004	31.15	10.85	27.68
Gigaword	36.92	18.29	34.58
LCSTS(word)	39.93	33.08	42.68

8. Deep Communicating Agents for Abstractive Summarization

Asli Celikyilmaz1, et al.; MSRA
NAACL2018
Models
這篇文章想解決長文本的摘要問題，將文本劃分成多個段落，對每個段落進行encode操作並計算內部的word attention，最後得到整個段落的embedding。之後將所有段落合並到一起計算Context Agent Attention，作為decoder的輸入。
改進了Pointer Network，可以從不同段落中抽取單詞。
同樣使用了Reinforcement Learning，但是前面的RL目標函數都是采樣得到一個y，再計算y與輸出的y^之間Rouge得分差作為優化目標。這篇文章的每輸出一個單詞就計算一次當前Rouge得分，減去上一次的得分，作為優化目標。

9. Global Encoding for Abstractive Summarization

Junyang Lin, Xu Sun, Shuming Ma, Qi Su; PKU
ACL2018
http://aclweb.org/anthology/P18-2027
Models

In neural abstractive summarization, the conventional sequence-to-sequence (seq2seq) model often suffers from repetition and semantic irrelevance. To tackle the problem, we propose a global encoding framework, which controls the information flow from the encoder to the decoder based on the global information of the source context. It consists of a convolutional gated unit to perform global encoding to improve the representations of the source-side information. Evaluations on the LCSTS and the English Gigaword both demonstrate that our model outperforms the baseline models, and the analysis shows that our model is capable of generating summary of higher quality and reducing repetition.

這篇文章想解決的問題是decoder在輸出時可能會不斷重復已有的單詞。在ICLR2018的A Deep Reinforced Model for Abstractive Summarization已經用過Intra-Decoder Attention嘗試從已經輸出過的內容上解決這個問題，這篇文章提出了Global Encoding方法使用源文本來嘗試解決這個問題。

10. Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting

ACL2018
https://arxiv.org/pdf/1805.11080.pdf
Inspired by how humans summarize long documents, we propose an accurate and fast summarization model that first selects salient sentences and then rewrites them abstractively (i.e., compresses and paraphrases) to generate a concise overall summary. We use a novel sentence-level policy gradient method to bridge the non-differentiable computation between these two neural networks in a hierarchical way, while maintaining language fluency. Empirically, we achieve the new state-of-the-art on all metrics (including human evaluation) on the CNN/Daily Mail dataset, as well as significantly higher abstractiveness scores. Moreover, by first operating at the sentence-level and then the word-level, we enable parallel decoding of our neural generative model that results in substantially faster (10-20x) inference speed as well as 4x faster training convergence than previous long-paragraph encoder-decoder models. We also demonstrate the generalization of our model on the test-only DUC-2002 dataset, where we achieve higher scores than a state-of-the-art model.

先選擇重要的句子，然後再改寫它生成一個簡潔全面的句子，保持了流暢性，decode速度很快。

11. Soft Layer-Specific Multi-Task Summarization with Entailment and Question Generation

Han Guo∗ Ramakanth Pasunuru∗ Mohit Bansal
ACL2018
http://aclweb.org/anthology/P18-1064

An accurate abstractive summary of a document should contain all its salient information and should be logically entailed by the input document. We improve these important aspects of abstractive summarization via multi-task learning with the auxiliary(輔助) tasks of question generation and entailment generation, where the former teaches the summarization model how to look for salient questioning-worthy details, and the latter teaches the model how to rewrite a summary which is a directed-logical subset of the input document. We also propose novel multi-task architectures with high-level (semantic) layer-specific sharing across multiple encoder and decoder layers of the three tasks, as well as soft-sharing mechanisms (and show performance ablations and analysis examples of each contribution). Overall, we achieve statistically significant improvements over the state-of-the-art on both the CNN/DailyMail and Gigaword datasets, as well as on the DUC-2002 transfer setup. We also present several quantitative and qualitative analysis studies of our model’s learned saliency and entailment skills.

用多任務學習輔助摘要生成。
1、question generation
teach摘要模型突出的值得質疑的細節。
2、entailment generation
teach模型重寫摘要which is a directed-logical subset of the input document.

a novel multi-task architectures：
1、high-level (semantic) layer-specific sharing across multiple encoder and decoder layers of the three tasks.
2、soft-sharing mechanisms.

12. Neural Document Summarization by Jointly Learning to Score and Select Sentences

Qingyu Zhou, Nan Yang, Furu Wei, Shaohan Huang, Ming Zhou, Tiejun Zhao
ACL2018
http://www.aclweb.org/anthology/P18-1061

Sentence scoring and sentence selection are two main steps in extractive document summarization systems. However, previous works treat them as two separated subtasks. In this paper, we present a novel end-to-end neural network framework for extractive document summarization by jointly learning to score and select sentences. It first reads the document sentences with a hierarchical encoder to obtain the representation of sentences. Then it builds the output summary by extracting sentences one by one. Different from previous methods, our approach integrates the selection strategy into the scoring model, which directly predicts the relative importance given previously selected sentences. Experiments on the CNN/Daily Mail dataset show that the proposed framework significantly outperforms the state-of-the-art extractive summarization models.

整合打分和挑選的模型，已經被挑選出來的句子在下一個decode不會再被encode，減少重復。

13. A Unified Model for Extractive and Abstractive Summarization using Inconsistency Loss

Wan-Ting Hsu, Chieh-Kai Lin, Ming-Ying Lee, Kerui Min, Jing Tang, Min Sun
ACL2018
http://aclweb.org/anthology/P18-1013

We propose a uniﬁed model combining the strength of extractive and abstractive summarization. On the one hand, a simple extractive model can obtain sentence-level attention with high ROUGE scores but less readable. On the other hand, a more complicated abstractive model can obtain word-level dynamic attention to generate a more readable paragraph. In our model, sentence-level attention is used to modulate the word-level attention such that words in less attended sentences are less likely to be generated. Moreover, a novel inconsistency loss function is introduced to penalize the inconsistency between two levels of attentions. By end-to-end training our model with the inconsistency loss and original losses of extractive and abstractive models, we achieve state-of-the-art ROUGE scores while being the most informative and readable summarization on the CNN/Daily Mail dataset in a solid human evaluation.

句子級別的抽取式摘要可以得到高的rouge分但是可讀性不高，詞級別的生成式摘要可以得到可讀性更高的句子。在這個模型中，使用sentence級的attention用來調節word級的attention，在sentence級別裏attention比較低的word傾向於不被產生。
使用兩個級別的model的不一致的loss function作爲penalty來train可以得到詳實的和可讀的信息，在人工的評價上也很高。

14. Retrieve, Rerank and Rewrite: Soft Template Based Neural Summarization

Ziqiang Cao, Wenjie Li, Sujian Li, Furu Wei
ACL2018
http://www4.comp.polyu.edu.hk/~cszqcao/data/IRSum.pdf

Most previous seq2seq summarization systems purely depend on the source text to generate summaries, which tends to work unstably. Inspired by the traditional template-based summarization approaches, this paper proposes to use existing summaries as soft templates to guide the seq2seq model. To this end, we use a popular IR platform to Retrieve proper summaries as candidate templates. Then, we extend the seq2seq framework to jointly conduct template Reranking and template-aware summary generation (Rewriting). Experiments show that, in terms of informativeness, our model significantly outperforms the state-of-the-art methods, and even soft templates themselves demonstrate high competitiveness. In addition, the import of high-quality external summaries improves the stability and readability of generated summaries.

15. Extractive Summarization with SWAP-NET: Sentences and Words from Alternating Pointer Networks

Aishwarya Jadhav, Vaibhav Rajan
ACL2018
http://aclweb.org/anthology/P18-1014

We present a new neural sequence-to-sequence model for extractive summarization called SWAP-NET (Sentences and Words from Alternating Pointer Networks). Extractive summaries comprising a salient subset of input sentences, often also contain important key words. Guided by this principle, we design SWAP-NET that models the interaction of key words and salient sentences using a new two-level pointer network based architecture. SWAP-NET identifies both salient sentences and key words in an input document, and then combines them to form the extractive summary. Experiments on large scale benchmark corpora demonstrate the efficacy of SWAP-NET that outperforms state-of-the-art extractive summarizers.

運用一個兩層的可選擇的pointer network結構，句子級和詞級，同時找出重要句子和關鍵字之間的交互然後重組寫成摘要。

Shuming Ma, Xu SUN, Junyang Lin, Houfeng WANG
ACL2018

Most of the current abstractive text summarization models are based on the sequence-to-sequence model (Seq2Seq). The source content of social media is long and noisy, so it is difficult for Seq2Seq to learn an accurate semantic representation. Compared with the source content, the annotated summary is short and well written. Moreover, it shares the same meaning as the source content. In this work, we supervise the learning of the representation of the source content with that of the summary. In implementation, we regard a summary autoencoder as an assistant supervisor of Seq2Seq. Following previous work, we evaluate our model on a popular Chinese social media dataset. Experimental results show that our model achieves the state-of-the-art performances on the benchmark dataset.

source content of social media is long and noisy，對於seq2seq來說很難學到精確的語義表示，跟原文本相比，注釋好的摘要 is short and well written。而且，他們的意思和原文本是一樣的。用這些summary進行監督式學習來學習source content的表示。In implementation, we regard a summary autoencoder as an assistant supervisor of Seq2Seq.

Dataset
Large Scale Chinese Social Media Text Summarization Dataset (LCSTS)

17. Reinforced Extractive Summarization with Question-Focused Rewards

Kristjan Arumae, Fei Liu
ACL2018
http://www.aclweb.org/anthology/P18-3015
We investigate a new training paradigm for extractive summarization. Traditionally, human abstracts are used to derive goldstandard labels for extraction units. However, the labels are often inaccurate, because human abstracts and source documents cannot be easily aligned at the word level. In this paper we convert human abstracts to a set of Cloze-style comprehension questions. System summaries are encouraged to preserve salient source content useful for answering questions and share common words with the abstracts. We use reinforcement learning to explore the space of possible extractive summaries and introduce a question-focused reward function to promote concise, fluent, and informative summaries. Our experiments show that the proposed method is effective. It surpasses state-of-the-art systems on the standard summarization dataset.