Skip to main content
  1. Data Science Blog/

Paper-Summary- A Survey Paper# Pretrained Language Models for Text Generation

·1772 words·9 mins· loading · ·
Language Models (LLMs) Generative AI AI/ML Research & Evaluation AI Research Papers Natural Language Processing (NLP) Research Papers Generative AI Transfer Learning Machine Learning (ML) Language Models (LLMs) AI Research

On This Page

Table of Contents
Share with :

Pretrained Language Models for Text Generation

Paper Name :- Pretrained Language Models for Text Generation: A Survey
Typer of Paper:- Survey Paper
Paper URL
Paper title of the citations mentioned can be found at AI Papers with Heading. Use citation code to locate.

Paper Summary :- Pretrained Language Models for Text Generation
#

Paper Outcome
#

  • General task definition
  • Describe the mainstream architectures of PLMs for text generation.
  • How to adapt existing PLMs to model different input data and satisfy special properties in the generated text.
  • Summarize several important fine-tuning strategies for text generation.

Ideas from the Paper
#

Main Ideas
#

  • This paper discusses “major advances achieved in the topic of PLMs for text generation”
  • This survey aims to provide “text generation researchers a synthesis” and pointer to related research.

General Ideas
#

  • Text generation has become one of the most important yet challenging tasks in natural language processing (NLP).
  • Neural generation model are deep learning models
  • Pretrained language models (PLMs) are neural generation model

Task Types and Typical Applications
#

  • In most cases, text generation is conditioned on input data, such as attributes, text and structured data, which is denoted as X. Formally, the text generation task can be described as: P(YjX ) = P(y1; : : : ; yj ; : : : ; ynjX )
  • If X is not provided or a random noise vector z, this task will degenerate into language modeling or unconditional generation task(generate text without any constraint) Radford2019
  • If X is a set of discrete attributes (e.g., topic words, sentiment labels), the task becomes topic-to-text generation or attribute-based generation. X plays the role of guiding the text generation. Keskar2019.
  • If X is structured data like knowledge graph or table, this task will be considered as KG-to-text or table-to-text generation (generate descriptive text about structured data), called data-to-text generation Li2021c.
  • If X is multimedia input such as image, the task becomes image caption Xia2020
  • If X is multimedia input such as speech, the task become speech recognition Fan2019.
  • If X text sequence (most common form), there are several applications such as machine translation, summarization and dialogue system.
  • Machine translation aims to translate text from one language into another language automatically Conneau2019
  • Generating condensed summary of a long document Zhang2019b
  • Dialogue system to converse with humans using natural language. Wolf2019

Architectures for Text Generation
#

  • Encoder-decoder Transformer. It is two stacks of Transformer blocks. The encoder is fed with an input sequence, while the decoder aims to generate the output sequence based on encoder-decoder self-attention mechanism.
  • Decoder-only Transformer. Employ a single Transformer decoder blocks. They apply unidirectional self-attention masking that each token can only attend to previous tokens.

Modeling Different Data Types from Input
#

Unstructured Input
#

  • Hierarchical BERT to learn interactions between sentences with self-attention for document encoding. [Zhang2019b] and [Xu2020b]
  • Capturing intersentential relations, DiscoBERT stacked graph convolutional network (GCN) on top of BERT to model structural discourse graphs. [Xu2020a]
  • Cross-lingual language models (XLMs) for multilingual language understanding. [Conneau2019]
  • Text generation models can obtain effective input word embeddings even in a low-resource language [Wada2018].

Structured Input
#

  • PLMs are not designed for structured or tabular data but for sequential text/data.
  • Incorporating PLMs for data-to text generation, especially in few-shot settings. [Chen2020b] and [Gong2020]
  • To adapt to the sequential nature of PLMs linearized input knowledge graph (KG) and abstract meaning representation (AMR) graph into a sequence of triples. [Ribeiro2020] and [Mager2020]
  • Introduced an additional graph encoder to encode the input KG. [Li2021b]
  • Template based method to serialize input table into text sequence. [Gong2020]
    • For example, the attribute-value pair “name: jack reynolds” will be serialized as a sentence “name is jack reynolds”. However, direct linearization will lose the structural information of original data, which may lead to generating unfaithful text about data.
  • Auxiliary reconstruction task for recovering the structural information of input data, which can enhance the capacity of modeling structural information. [Gong2020]
  • The pointer generator mechanism is adopted to copy words from input knowledge data. [See2017] [Chen2020b].
  • Content matching loss for measuring the distance between the information in input data and the output text. [Gong2020]

Multimedia Input
#

  • Conducted pretraining for the video caption task. VideoBERT [Sun2019b] and CBT [Sun2019a]
  • Used a shared multi-layer Transformer network for both encoding and decoding. Unified VLP [Zhou2020]
  • Pretrained the model on two masked language modeling (MLM) tasks, like cloze tasks designed for sequence-to-sequence LM. UniLM [Dong2019]
  • Cross-modal pretrained model (XGPT) by taking images as inputs and using the image caption task as the basic generative task in the pretraining stage. Xia2020
  • Image, video, speech recognition is hungry for human-transcripted supervised data.
  • Integrate PLMs for weakly-supervised learning. For example,
    • Unsupervised approach to pretraining encoder-decoder model with unpaired speech and transcripts. [Fan2019]
  • Two pretraining stages are used to extract acoustic and linguistic information with speech and transcripts, which is useful for downstream speech recognition task.

Satisfying Special Properties for Output Text
#

  • Generated text should satisfy several key properties like. relevance, faithfulness, and order-preservation.
  • Relevance. Relevance refers that the topics in output text is highly related to the input text. The generated responses should also be relevant to the condition. RNN-based models still tend to generate irrelevant output text and lack consistency with input. - When applying PLMs to the task of dialogue systems, TransferTransfo and DialoGPT were able to generate more relevant responses than RNNbased models. [Wolf2019] [Zhang2020] - Utilize elaborated condition blocks to incorporate external conditions. They used BERT for both encoder and decoder by utilizing different input representations and self-attention masks to distinguish the source and target sides of dialogue. On the target (generation) side, a new attention routing mechanism is adopted to generate context-related words. [Zeng2020] - Approach for non-conditioned dialogue [Bao2020].
  • Faithfulness. Means the content in generated text should not contradict the facts in input text.
    • PLMs are potentially beneficial to generate faithful text by utilizing background knowledge.
    • Initialize the encoder and decoder with three outstanding PLMs, i.e., BERT, GPT and RoBERTa. [Rothe2020]
    • With pretraining, the models are more aware of the domain characteristics and less prone to language model vulnerabilities.
    • Decompose the decoder into a contextual network that retrieves relevant parts of the source document and a PLM that incorporates prior knowledge about language generation. [Kryscinski2018]
    • Generate faithful text in different target domains, fine-tuned PLMs on target domains through theme modeling loss. [Yang2020b]
  • Order-preservation. Order-preservation denotes that the order of semantic units (word, phrase, etc.) in both input and output text is consistent.
    • When translating from source language to target language, keeping the order of phrases consistent in source language and target language will ensure the accuracy of the translation.
    • Code-Switching Pre-training (CSP) for machine translation. [Yang2020a]
      • Extracted the word-pair alignment information from the source and target language,
      • Aplied the extracted alignment information to enhance order-preserving.
      • Translation across multiple languages, called multilingual machine translation [Conneau2019].
      • mRASP (technique of randomly aligned substitution), an approach to pretraining a universal multilingual machine translation model. [Lin2020]
      • Aligning word representations of each language, making it possible to preserve the word order consistent cross multiple languages. Wada2018

Summary from Introduction
#

  • Researchers have developed numerous techniques for a wide range of applications of text generation [Li2021a].
  • Machine translation generates text in a different language based on the source text [Yang2020a];
  • Summarization generates an abridged version of the source text to include salient information [Guan2020].
  • Text generation tasks based on
    • Recurrent neural networks (RNN) [Li2019],
    • Convolutional neural networks (CNN) [Gehring2017],
    • Graph neural networks (GNN) [Li2020],
    • Attention mechanism [Bahdanau2015].
  • One of the advantages of these neural models is that they enable end-to-end learning of semantic mappings from input to output in text generation.
  • Neural models are able to learn low-dimensional, dense vectors to implicitly represent linguistic features of text, which is also useful to alleviate data sparsity.
  • Deep neural networks usually have a large number of parameters to learn, which are likely to overfit on these small datasets and do not generalize well in practice.
  • The idea behind PLMs is to first pretrain the models in large-scale corpus and then finetune these models in various downstream tasks to achieve state-of-the-art results.
  • PLMs can encode a large amount of linguistic knowledge from corpus and induce universal representations of language.
  • PLMs are generally beneficial for downstream tasks and can avoid training a new model from scratch [Brown2020].
  • A synthesis to the research on some text generation subtasks. Zaib et al. [2020], and Guan et al. [2020]

Conclusion & Future Recommendations
#

Model Extension.

  • Discrepancies between pretraining and downstream generation tasks. For example, the “[MASK]” token in pretraining stage will not be used in fine-tuning stage, which further aggravates the pretraining finetuning discrepancy.
  • Design an appropriate pretraining paradigm for text generation.
  • Incorporating external knowledge into PLMs during pretraining has been shown to be effective Zhang2019c

Controllable Generation.

  • Controlling some attributes of the generated text has many useful applications such as generating positive response to patients with depression in dialogue systems.
  • PLMs are usually pretrained in universal corpus, which is difficult to control the multi-grained attributes of the generated text (e.g., sentiment, topic, and coherence).
  • These control codes are preset and coarse-grained. Keskar et al. [2019]
  • Future work can explore multi-grained control and develop PLMs that are sufficiently steerable.

Model Compression.

  • PLMs with large-scale parameters models are challenging to be deployed in resource constrained environments.
  • Study how to achieve competitive performance with a small number of parameters.
  • Several methods have been proposed to compress PLMs, such as parameter sharing [Lan2020] - ALBERT
  • Knowledge distillation [Sanh2019] - DistilBERT
  • Compress PLMs for text generation.

Fine-tuning Exploration:

  • The direct intention of pretraining is to distill the linguistic knowledge learned in PLMs to downstream generation tasks.
  • Various ways to transfer knowledge from PLMs to downstream models.
  • Exploited knowledge distillation by adopting BERT as teacher model and a vanilla RNN generation model as student model. Chen et al. [2020a]

Language-agnostic PLMs:

  • PLMs for text generation are mainly based on English. These PLMs will encounter challenges when dealing with non-English generation tasks.
  • Language-agnostic PLMs are worthy to be investigated, which need to capture universal linguistic and semantic features across different languages.
  • An interesting direction is how to reuse existing English-based PLMs for text generation in non-English languages.

Ethical Concern:

  • Currently, PLMs are pretrained on largescale corpus crawled from the web without fine-grained filtering, potentially causing ethical issues such as generating private content about users. Therefore, researchers should try their best to prevent misusing PLMs.
  • Identifying threats and potential impacts and assessing likelihood. Ross [2012]
  • The text generated by PLMs might be prejudiced, which is in line with the bias in training data along the dimensions of gender, race, and religion [Brown2020].

Author
Dr Hari Thapliyaal
dasarpai.com
linkedin.com/in/harithapliyal

Dr. Hari Thapliyaal's avatar

Dr. Hari Thapliyaal

Dr. Hari Thapliyal is a seasoned professional and prolific blogger with a multifaceted background that spans the realms of Data Science, Project Management, and Advait-Vedanta Philosophy. Holding a Doctorate in AI/NLP from SSBM (Geneva, Switzerland), Hari has earned Master's degrees in Computers, Business Management, Data Science, and Economics, reflecting his dedication to continuous learning and a diverse skill set. With over three decades of experience in management and leadership, Hari has proven expertise in training, consulting, and coaching within the technology sector. His extensive 16+ years in all phases of software product development are complemented by a decade-long focus on course design, training, coaching, and consulting in Project Management. In the dynamic field of Data Science, Hari stands out with more than three years of hands-on experience in software development, training course development, training, and mentoring professionals. His areas of specialization include Data Science, AI, Computer Vision, NLP, complex machine learning algorithms, statistical modeling, pattern identification, and extraction of valuable insights. Hari's professional journey showcases his diverse experience in planning and executing multiple types of projects. He excels in driving stakeholders to identify and resolve business problems, consistently delivering excellent results. Beyond the professional sphere, Hari finds solace in long meditation, often seeking secluded places or immersing himself in the embrace of nature.

Comments:

Share with :

Related

What is a Digital Twin?
·805 words·4 mins· loading
Industry Applications Technology Trends & Future Computer Vision (CV) Digital Twin Internet of Things (IoT) Manufacturing Technology Artificial Intelligence (AI) Graphics
What is a digital twin? # A digital twin is a virtual representation of a real-world entity or …
Frequencies in Time and Space: Understanding Nyquist Theorem & its Applications
·4103 words·20 mins· loading
Data Analysis & Visualization Computer Vision (CV) Mathematics Signal Processing Space Exploration Statistics
Applications of Nyquists theorem # Can the Nyquist-Shannon sampling theorem applies to light …
The Real Story of Nyquist, Shannon, and the Science of Sampling
·1146 words·6 mins· loading
Technology Trends & Future Interdisciplinary Topics Signal Processing Remove Statistics Technology Concepts
The Story of Nyquist, Shannon, and the Science of Sampling # In the early days of the 20th century, …
BitNet b1.58-2B4T: Revolutionary Binary Neural Network for Efficient AI
·2637 words·13 mins· loading
AI/ML Models Artificial Intelligence (AI) AI Hardware & Infrastructure Neural Network Architectures AI Model Optimization Language Models (LLMs) Business Concepts Data Privacy Remove
Archive Paper Link BitNet b1.58-2B4T: The Future of Efficient AI Processing # A History of 1 bit …
Ollama Setup and Running Models
·1753 words·9 mins· loading
AI and NLP Ollama Models Ollama Large Language Models Local Models Cost Effective AI Models
Ollama: Running Large Language Models Locally # The landscape of Artificial Intelligence (AI) and …