In 2023, semANT brought its first tangible fruit: the TextBite software package. TextBite provides a semantic layout analysis on top of plain OCR output. It enhances a PAGE XML description of an analyzed page by introducing title elements, clustering text lines in semantically related parts (chapters, articles, dictionary entries, …), creating a reading order and altering already present regions as needed. All of this new information is stored in a standard way described by the PAGE standard, allowing for further processing.

Fig.1: Application of TextBite on a new input document. First, the Pero OCR is used to get the basic layout of the document, TextBite is then applied to extract logical elements.

Technical solution

The core of TextBite is a detector model based on YOLOv8. This detector identifies logical chunks directly in the image of the page. These detections are then merged with the available region and textline information to provide an enhanced page representation.

To train the detector, we have collected a custom dataset of publicly available pages from the Czech Digital Library. To promote diversity of the dataset, pages were specifically sampled to cover periodicals, dictionaries and books as major classes of documents, complemented by completely random pages from the whole collection. These pages were annotated for logical units by volunteers. For the current version of TextBite, ca. 1600 pages were used for training and ca. 100 were kept for validation and testing each.

Regions identified by the volunteers as logical units were then aligned with textlines detected by the Pero system to provide a precise annotation for training and evaluating the detector model.

When deployed, TextBite operates in five steps: (1) The detector provides rectangular predictions of continuous logical parts. These (2) are aligned with textlines provided in the corresponding XML. As needed, text regions in the XML (3) are refined to match the logical boundaries provided by the detector. Once this is done, we conservatively (4) link the individual parts together, typically merging text regions with their preceding titles. Finally, this information (5) is stored in the PageXML format.

Fig.2: A complex newspaper page with line advertisement. TextBite has correctly identified the individual ads. Note that it tries to stay as faithful as possible to the regions detected in the PAGE XML, including those that do not eventually correspond to actual text.

Try it out

We publish a detection model trained on a diverse mixture of various documents annotated for logical chunks of data (books, periodicals of various layouts, dictionaries). The model is available online. The overall V-measure of TextBite using this particular model is 86 %, measured on validation data.