Results

Czech Large Language Model CSMPT-7B

Czech Large Language Model CSMPT-7B In March 2024, we publicly released the first Czech-only large language model csmpt7b. Our language model was trained on dataset collected from Czech internet, Internet Archive, and also on publicly available historical texts ranging from the year 1850 until now. The texts were transcribed using our Pero OCR system. Training…

April 23, 2024
TextBite

In 2023, semANT brought its first tangible fruit: the TextBite software package. TextBite provides a semantic layout analysis on top of plain OCR output. It enhances a PAGE XML description of an analyzed page by introducing title elements, clustering text lines in semantically related parts (chapters, articles, dictionary entries, …), reading order and altering already…

March 6, 2024