Results – semANT

Czech Large Language Model CSMPT-7B

Czech Large Language Model CSMPT-7B In March 2024, we publicly released the first Czech-only large language model csmpt7b. Our language model was trained on dataset collected from Czech internet, Internet Archive, and also on publicly available historical texts ranging from the year 1850 until now. The texts were transcribed using our Pero OCR system. Training […]

TextBite

Results / semant

In 2023, semANT brought its first tangible fruit: the TextBite software package. TextBite provides a semantic layout analysis on top of plain OCR output. It enhances a PAGE XML description of an analyzed page by introducing title elements, clustering text lines in semantically related parts (chapters, articles, dictionary entries, …), reading order and altering already present regions as needed. All of this new information is stored in a standard way described by the PAGE standard, allowing for further processing.

TextBite Read More »