The past three years have seen significant interest in applying language models to the task of visual document understanding – integrating spatial, textual, and visual signals to make sense of PDFs, web pages, and scanned documents. We'll trace the development of this niche through the lens of 8 published works.
- LAMBERT: Layout-Aware language Modeling using BERT for information extraction by Łukasz Garncarek, Rafał Powalski, Tomasz Stanisławek, Bartosz Topolski, Piotr Halama, Michał Turski, Filip Graliński
- BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding by Timo I. Denk, Christian Reisswig
- LayoutLM: Pre-training of Text and Layout for Document Image Understanding by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou
- LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou
- Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer by Rafał Powalski, Łukasz Borchmann, Dawid Jurkiewicz, Tomasz Dwojak, Michał Pietruszka, Gabriela Pałka
- BROS: A pre-trained language model for understanding texts in document by Teakgyu Hong, DongHyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, Sungrae Park
- VisualMRC: Machine Reading Comprehension on Document Images by Ryota Tanaka, Kyosuke Nishida, Sen Yoshida
- DocFormer: End-to-End Transformer for Document Understanding by Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, R. Manmatha
Document understanding models are largely BERT-style encoder models. Most models incorporate spatial, textual, and visual information -- although a few exceptions exclude visual features.
|Model||Model Type||Data Modalities|
|LayoutLM||Encoder||Text, Layout, Visual|
|LayoutLMv2||Encoder||Text, Layout, Visual|
|TILT||Encoder-Decoder||Text, Layout, Visual|
|LayoutT5, LayoutBART||Encoder-Decoder||Text, Layout, Visual|
|DocFormer||Encoder||Text, Layout, Visual|
Novel pre-training objectives are described in each model's brief summary.
|Model||Pre-training Loss||Pre-training Datasets|
|LAMBERT||BERT-like MLM||Subset of Common Crawl PDFs|
UCSF Industry Document Library,
Subset of Common Crawl PDFs
"Text Describes Image" Loss
Tracks how easy or difficult it would be to reproduce or build upon the results reported in the paper.
|Model||OCR||Public Code||Pre-trained Model Available?|
|LayoutLMv2||Microsoft Read API||Yes||Yes|
|TILT||Mix (Textract, Azure OCR)||No||No|
Trends and Observations
- Several papers (LAMBERT, BROS) endeavor to remove the need for 1D-position information (reading order) supplied by an OCR system or heuristic. In all cases ablations report that 1D reading order is beneficial in addition to 2D-position. This is problematic, in that it means the model's behavior is dependent on the OCR system used (as different OCR engines are liable to infer significantly different reading orders for forms and other documents with complex layout). I'm hopeful dataset scale will be sufficient to shrink the gap and supplant the need to include explicit reading order.
- TILT, LayoutT5, and BROS partially resolve reading order issues by relying on auto-regressive or graph-based decoding schemes. In a sequence labeling setting, this means even if the reading order of the source document is incorrectly inferred (e.g. in the context of a table), it's plausible the model could produce an output with the correct reading order.
Visual Features + Fusion
- All models except for LAMBERT and BERTGrid include some amount of visual information in addition to layout information. Where text, layout, and image information are combined varies significantly – LAMBERT fuses layout information with text at model input, DocFormer preserves a separate set of tokens for each data modality but permits interaction between modes at each layer, and the initial LayoutLM fused image information only immediately before the output layer.
- IIT-CDIP is most commonly used for model pre-training. This is a bit disconcerting if we are aiming to learn general document representation – IIT-CDIP is exclusively concerned with the tobacco industry, and although there is significant variety in document layout the textual signal is more narrow-domain than is desirable. As this subfield is still in its infancy, it's a sure bet that there are easy improvements possible by merely increasing dataset breadth and scale.
Open Source Contributions
- Only LayoutLM and LayoutLMv2 currently include public model checkpoints. Compared to the broader NLP field, researchers seem less willing to make document understanding work open source, perhaps because of the proximity to commercial application.
Key Contributions by Paper:
- Adapts a pre-trained RoBERTa checkpoint to add layout features and relative 2D position embeddings via model surgery.
- Strong empirical performance for parameter count.
- One of the few models to publish source code, and one of the few document understanding models not to incorporate the IIT-CDIP corpus, instead opting to rely on Common Crawl.
- Encodes text of the document using BERT, then stores the resulting contextual embeddings at the spatial location of each word. BERTGrid then takes an semantic-segmentation or bounding-box regression approach to solve downstream tasks.
- Based off of earlier work, CharGrid, which used one hot character embeddings in place of BERT embeddings but otherwise followed the same methodology.
- Fully open-source (code + model checkpoints), and trained on outputs from an open source OCR engine (Tesseract)
- Learns a separate embedding for top, left, right, bottom, width, and height attributes of each token bounding box.
- Separately computes visual features for each token bounding box using a Faster-RCNN model.
- Image features and token features are combined just prior to the task specific head.
- Uses a masked language modeling loss in conjunction with a document-level classification loss (using metadata from the IIT-CDIP dataset)
- LayoutLMv2 resolves some key shortcomings of LayoutLM, leveraging visual "tokens" and permitting attention between textual and visual features rather than performing fusion immediately prior to the target model.
- In addition, LayoutLMv2 introduces two new pre-training objectives to be used in conjunction with the masked language modeling loss. The first, text-image alignment, involves predicting whether a given token has been masked out in the image of the document, and is introduced to encouraging learning the correspondence between textual and visual features. The second, text-image matching, involves predicting whether the document image supplied matches the textual content and can be interpreted as a contrastive loss.
- First document understanding model to employ an encoder-decoder architecture
- Uses a T5-like masked language modeling loss (shown below)
- Uses relative positional biases based on 1D and 2D position
- Uses a masked-span language modeling objective
- For downstream question answering and key information extraction tasks, the model directly outputs token sequences (in an order that may be different from the order tokens were fed to the encoder).
- TILT also performs affine augmentations on token image inputs, in addition to augmentations that vary the distance between tokens to help TILT be robust to these sorts of transformations.
- The main contribution of the BROS language model is the use of the graph-based "SPADE" decoder during finetuning. For sequence labeling tasks, SPADE permits outputting sequences of tokens from the source document in an order different from the serialization used in the encoder by jointly predicting span starts and a matrix of directed edges.
- BROS uses sinusoidal embeddings for 2D position embeddings (similar to LAMBERT) and a MLM loss
- The authors highlight that there model degrades more slowly in the presence of poor OCR reading order than competing methods (e.g. LayoutLM)
- Main contribution is a reading comprehension dataset based on Wikipedia that requires layout understanding (VisualMRC)
- Build off of pre-trained T5 and BART backbones
- In addition to visual representations of each token generated by a Faster-RCNN model, visual representations of 100 regions of interest are included
- Like TILT, uses an auto-regressive decoding scheme that decouples the model from input reading order.
- DocFormer keeps data modalities largely separate throughout the transformer stack and re-injects visual and spatial features at each layer as a kind of residual connection.
- Uses only the first 4 layers of a ResNet50 for visual feature representations, and directly integrates with the transformer portion of the model, permitting end-to-end training.
- Introduces a multi-modal cross-attention mechanism to permit sharing of information across data modalities.
- Introduces an image reconstruction loss where image and text features are used to reconstruct the source document image, along with a binary classification loss that predicts whether text was paired with the correct document image or with a negative sample.
- Empirical results indicate that DocFormer punches above it's weight class, scoring better than significantly larger models on the RVL-CDIP document classification task.
- Performs extensive ablations on architecture decisions, pre-training objectives, and effect of including spatial / visual data modalities.
Hope this was a useful overview of this developing subfield! Feel free to email me at email@example.com if you notice any errata or have suggestions for how I can make surveys like this one more useful.