A Survey of Document Understanding Models

The past three years have seen significant interest in applying language models to the task of visual document understanding – integrating spatial, textual, and visual signals to make sense of PDFs, web pages, and scanned documents. We'll trace the development of this niche through the lens of 8 published works.

LAMBERT: Layout-Aware language Modeling using BERT for information extraction by Łukasz Garncarek, Rafał Powalski, Tomasz Stanisławek, Bartosz Topolski, Piotr Halama, Michał Turski, Filip Graliński
BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding by Timo I. Denk, Christian Reisswig
LayoutLM: Pre-training of Text and Layout for Document Image Understanding by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer by Rafał Powalski, Łukasz Borchmann, Dawid Jurkiewicz, Tomasz Dwojak, Michał Pietruszka, Gabriela Pałka
BROS: A pre-trained language model for understanding texts in document by Teakgyu Hong, DongHyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, Sungrae Park
VisualMRC: Machine Reading Comprehension on Document Images by Ryota Tanaka, Kyosuke Nishida, Sen Yoshida
DocFormer: End-to-End Transformer for Document Understanding by Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, R. Manmatha

Model Architecture

Document understanding models are largely BERT-style encoder models. Most models incorporate spatial, textual, and visual information -- although a few exceptions exclude visual features.

Model	Model Type	Data Modalities
LAMBERT	Encoder	Text, Layout
BERTGrid	Encoder	Text, Layout
LayoutLM	Encoder	Text, Layout, Visual
LayoutLMv2	Encoder	Text, Layout, Visual
TILT	Encoder-Decoder	Text, Layout, Visual
BROS	Encoder	Text, Layout
LayoutT5, LayoutBART	Encoder-Decoder	Text, Layout, Visual
DocFormer	Encoder	Text, Layout, Visual

Model Pretraining

Novel pre-training objectives are described in each model's brief summary.

Model	Pre-training Loss	Pre-training Datasets
LAMBERT	BERT-like MLM	Subset of Common Crawl PDFs
BERTGrid	N/A	N/A
LayoutLM	BERT-like MLM, Document Classification	IIT-CDIP
LayoutLMv2	BERT-like MLM, Text-Image Alignment, Text-Image Matching	IIT-CDIP
TILT	T5-like MLM	IIT-CDIP, UCSF Industry Document Library, Subset of Common Crawl PDFs
BROS	BERT-like MLM	IIT-CDIP
LayoutT5, LayoutBART	N/A	VisualMRC
DocFormer	Bert-like MLM, Text-to-Image Reconstruction, "Text Describes Image" Loss	IIT-CDIP

Reproducibility

Tracks how easy or difficult it would be to reproduce or build upon the results reported in the paper.

Model	OCR	Public Code	Pre-trained Model Available?
LAMBERT	Tesseract 4.1	Yes	No
BERTGrid	Unspecified	No	No
LayoutLM	Tesseract	Yes	Yes
LayoutLMv2	Microsoft Read API	Yes	Yes
TILT	Mix (Textract, Azure OCR)	No	No
BROS	In-house	No	No
LayoutT5, LayoutBART	Tesseract	No	No
DocFormer	Textract	No	No

Trends and Observations

Reading Order:

Several papers (LAMBERT, BROS) endeavor to remove the need for 1D-position information (reading order) supplied by an OCR system or heuristic. In all cases ablations report that 1D reading order is beneficial in addition to 2D-position. This is problematic, in that it means the model's behavior is dependent on the OCR system used (as different OCR engines are liable to infer significantly different reading orders for forms and other documents with complex layout). I'm hopeful dataset scale will be sufficient to shrink the gap and supplant the need to include explicit reading order.
TILT, LayoutT5, and BROS partially resolve reading order issues by relying on auto-regressive or graph-based decoding schemes. In a sequence labeling setting, this means even if the reading order of the source document is incorrectly inferred (e.g. in the context of a table), it's plausible the model could produce an output with the correct reading order.

Visual Features + Fusion

All models except for LAMBERT and BERTGrid include some amount of visual information in addition to layout information. Where text, layout, and image information are combined varies significantly – LAMBERT fuses layout information with text at model input, DocFormer preserves a separate set of tokens for each data modality but permits interaction between modes at each layer, and the initial LayoutLM fused image information only immediately before the output layer.

Datasets

IIT-CDIP is most commonly used for model pre-training. This is a bit disconcerting if we are aiming to learn general document representation – IIT-CDIP is exclusively concerned with the tobacco industry, and although there is significant variety in document layout the textual signal is more narrow-domain than is desirable. As this subfield is still in its infancy, it's a sure bet that there are easy improvements possible by merely increasing dataset breadth and scale.

Open Source Contributions

Only LayoutLM and LayoutLMv2 currently include public model checkpoints. Compared to the broader NLP field, researchers seem less willing to make document understanding work open source, perhaps because of the proximity to commercial application.

Key Contributions by Paper:

LAMBERT: Layout-Aware language Modeling using BERT for information extraction

by Łukasz Garncarek, Rafał Powalski, Tomasz Stanisławek, Bartosz Topolski, Piotr Halama, Michał Turski, Filip Graliński

Adapts a pre-trained RoBERTa checkpoint to add layout features and relative 2D position embeddings via model surgery.
Strong empirical performance for parameter count.
One of the few models to publish source code, and one of the few document understanding models not to incorporate the IIT-CDIP corpus, instead opting to rely on Common Crawl.

BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding

by Timo I. Denk, Christian Reisswig

Encodes text of the document using BERT, then stores the resulting contextual embeddings at the spatial location of each word. BERTGrid then takes an semantic-segmentation or bounding-box regression approach to solve downstream tasks.
Based off of earlier work, CharGrid, which used one hot character embeddings in place of BERT embeddings but otherwise followed the same methodology.

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou

Fully open-source (code + model checkpoints), and trained on outputs from an open source OCR engine (Tesseract)
Learns a separate embedding for top, left, right, bottom, width, and height attributes of each token bounding box.
Separately computes visual features for each token bounding box using a Faster-RCNN model.
Image features and token features are combined just prior to the task specific head.
Uses a masked language modeling loss in conjunction with a document-level classification loss (using metadata from the IIT-CDIP dataset)

by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou

LayoutLMv2 resolves some key shortcomings of LayoutLM, leveraging visual "tokens" and permitting attention between textual and visual features rather than performing fusion immediately prior to the target model.
In addition, LayoutLMv2 introduces two new pre-training objectives to be used in conjunction with the masked language modeling loss. The first, text-image alignment, involves predicting whether a given token has been masked out in the image of the document, and is introduced to encouraging learning the correspondence between textual and visual features. The second, text-image matching, involves predicting whether the document image supplied matches the textual content and can be interpreted as a contrastive loss.

Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer

by Rafał Powalski, Łukasz Borchmann, Dawid Jurkiewicz, Tomasz Dwojak, Michał Pietruszka, Gabriela Pałka

First document understanding model to employ an encoder-decoder architecture
Uses a T5-like masked language modeling loss (shown below)

Loss from "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

Uses relative positional biases based on 1D and 2D position
Uses a masked-span language modeling objective
For downstream question answering and key information extraction tasks, the model directly outputs token sequences (in an order that may be different from the order tokens were fed to the encoder).
TILT also performs affine augmentations on token image inputs, in addition to augmentations that vary the distance between tokens to help TILT be robust to these sorts of transformations.

BROS: A pre-trained language model for understanding texts in document

by Teakgyu Hong, DongHyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, Sungrae Park

The main contribution of the BROS language model is the use of the graph-based "SPADE" decoder during finetuning. For sequence labeling tasks, SPADE permits outputting sequences of tokens from the source document in an order different from the serialization used in the encoder by jointly predicting span starts and a matrix of directed edges.
BROS uses sinusoidal embeddings for 2D position embeddings (similar to LAMBERT) and a MLM loss
The authors highlight that there model degrades more slowly in the presence of poor OCR reading order than competing methods (e.g. LayoutLM)

VisualMRC: Machine Reading Comprehension on Document Images (LayoutT5, LayoutBART)

by Ryota Tanaka, Kyosuke Nishida, Sen Yoshida

Main contribution is a reading comprehension dataset based on Wikipedia that requires layout understanding (VisualMRC)
Build off of pre-trained T5 and BART backbones
In addition to visual representations of each token generated by a Faster-RCNN model, visual representations of 100 regions of interest are included
Like TILT, uses an auto-regressive decoding scheme that decouples the model from input reading order.

Architecture for LayoutT5 and LayoutBART models

DocFormer: End-to-End Transformer for Document Understanding

by Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, R. Manmatha

DocFormer keeps data modalities largely separate throughout the transformer stack and re-injects visual and spatial features at each layer as a kind of residual connection.
Uses only the first 4 layers of a ResNet50 for visual feature representations, and directly integrates with the transformer portion of the model, permitting end-to-end training.
Introduces a multi-modal cross-attention mechanism to permit sharing of information across data modalities.
Introduces an image reconstruction loss where image and text features are used to reconstruct the source document image, along with a binary classification loss that predicts whether text was paired with the correct document image or with a negative sample.
Empirical results indicate that DocFormer punches above it's weight class, scoring better than significantly larger models on the RVL-CDIP document classification task.
Performs extensive ablations on architecture decisions, pre-training objectives, and effect of including spatial / visual data modalities.

Hope this was a useful overview of this developing subfield! Feel free to email me at madison@pragmatic.ml if you notice any errata or have suggestions for how I can make surveys like this one more useful.

A Survey of Document Understanding Models

Model Architecture

Model Pretraining

Reproducibility

Trends and Observations

Key Contributions by Paper:

LAMBERT: Layout-Aware language Modeling using BERT for information extraction

BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer

BROS: A pre-trained language model for understanding texts in document

VisualMRC: Machine Reading Comprehension on Document Images (LayoutT5, LayoutBART)

DocFormer: End-to-End Transformer for Document Understanding

May we suggest a tag?

May we suggest an author?

Madison May

A Survey of Document Understanding Models

Model Architecture

Model Pretraining

Reproducibility

Trends and Observations

Key Contributions by Paper:

Representation Learning and Retrieval

Large Memory Layers with Product Keys

Finetuning Transformers with JAX + Haiku

Subscribe to see what we're thinking

Great!

May we suggest a tag?

May we suggest an author?

Madison May