{"id":7117,"date":"2024-01-31T13:09:46","date_gmt":"2024-01-31T12:09:46","guid":{"rendered":"https:\/\/content-conversion.com\/htr-benchmark\/"},"modified":"2024-04-04T10:19:51","modified_gmt":"2024-04-04T08:19:51","slug":"htr-benchmark","status":"publish","type":"page","link":"https:\/\/content-conversion.com\/de\/htr-benchmark\/","title":{"rendered":"HTR Benchmark"},"content":{"rendered":"<p>[et_pb_section fb_built=&#8220;1&#8243; admin_label=&#8220;section&#8220; _builder_version=&#8220;3.22&#8243;][et_pb_row _builder_version=&#8220;4.9.4&#8243;][et_pb_column type=&#8220;4_4&#8243; _builder_version=&#8220;3.25&#8243; custom_padding=&#8220;|||&#8220; custom_padding__hover=&#8220;|||&#8220;][et_pb_text _builder_version=&#8220;4.9.4&#8243; header_font=&#8220;Open Sans Condensed Light local|300|||||||&#8220; header_text_color=&#8220;#666666&#8243; header_font_size=&#8220;70px&#8220; header_letter_spacing=&#8220;1px&#8220; max_width=&#8220;100%&#8220; animation_style=&#8220;slide&#8220; animation_direction=&#8220;left&#8220; header_font_tablet=&#8220;&#8220; header_font_phone=&#8220;||||||||&#8220; header_font_last_edited=&#8220;on|phone&#8220; header_font_size_phone=&#8220;50px&#8220; locked=&#8220;off&#8220;]<\/p>\n<h1 class=\"wp-block-heading\" style=\"text-align: center;\"><span style=\"color: #000000;\">HTR <\/span>Bechmark<\/h1>\n<h2 class=\"wp-block-heading\" style=\"text-align: center;\">(<span style=\"text-decoration: underline;\">H<\/span>andwriting <span style=\"text-decoration: underline;\">TR<\/span>anscription)<\/h2>\n<p>&nbsp;<\/p>\n<p>[\/et_pb_text][\/et_pb_column][\/et_pb_row][\/et_pb_section][et_pb_section fb_built=&#8220;1&#8243; _builder_version=&#8220;4.4.2&#8243; background_color=&#8220;#919191&#8243; background_enable_image=&#8220;off&#8220; parallax=&#8220;on&#8220; min_height=&#8220;194px&#8220; custom_margin=&#8220;||-44px|||&#8220; custom_padding=&#8220;15px||0px||false|false&#8220; animation_style=&#8220;fade&#8220; animation_direction=&#8220;right&#8220; background_last_edited=&#8220;off|desktop&#8220; background_enable_color_phone=&#8220;off&#8220; background_blend_phone=&#8220;normal&#8220; border_color_top=&#8220;#1e69ae&#8220; border_color_bottom=&#8220;#1e69ae&#8220; locked=&#8220;off&#8220;][et_pb_row _builder_version=&#8220;4.4.1&#8243;][et_pb_column type=&#8220;4_4&#8243; _builder_version=&#8220;3.25&#8243; custom_padding=&#8220;|||&#8220; custom_padding__hover=&#8220;|||&#8220;][et_pb_text _builder_version=&#8220;4.9.4&#8243; text_font=&#8220;Open Sans Condensed||||||||&#8220; text_text_color=&#8220;#ffffff&#8220; text_font_size=&#8220;26px&#8220; text_letter_spacing=&#8220;1px&#8220; text_line_height=&#8220;1.4em&#8220; text_orientation=&#8220;center&#8220; max_width=&#8220;750px&#8220; module_alignment=&#8220;center&#8220; min_height=&#8220;196.6px&#8220; custom_margin=&#8220;||-73px|||&#8220; custom_padding=&#8220;16px||0px|||&#8220; animation_style=&#8220;slide&#8220; animation_direction=&#8220;right&#8220;]<\/p>\n<p>Transkribus performs best on Cyrillic Handwriting, compared with Tesseract, Calamari and Glyph.<\/p>\n<p>[\/et_pb_text][\/et_pb_column][\/et_pb_row][\/et_pb_section][et_pb_section fb_built=&#8220;1&#8243; _builder_version=&#8220;4.9.4&#8243; _module_preset=&#8220;default&#8220; min_height=&#8220;477.8px&#8220;][et_pb_row _builder_version=&#8220;4.9.4&#8243; _module_preset=&#8220;default&#8220;][et_pb_column type=&#8220;4_4&#8243; _builder_version=&#8220;4.9.4&#8243; _module_preset=&#8220;default&#8220;][et_pb_text _builder_version=&#8220;4.9.4&#8243; _module_preset=&#8220;default&#8220;]<\/p>\n<h2 style=\"text-align: left;\"><span>Introduction<\/span><\/h2>\n<p><span>In over 20 years of experience in content conversion, CCS has collected many datapoints on performance of various OCR engines. From time to time, we conduct evaluations and provide them to our partners. After integrating HTR into our product docWizz in 2021, we have endeavored to build a set of ground truth for training and evaluating HTR systems. In this whitepaper we share the results of our first evaluation conducted.<\/span><\/p>\n<p>[\/et_pb_text][\/et_pb_column][\/et_pb_row][et_pb_row _builder_version=&#8220;4.9.4&#8243; _module_preset=&#8220;default&#8220;][et_pb_column type=&#8220;4_4&#8243; _builder_version=&#8220;4.9.4&#8243; _module_preset=&#8220;default&#8220;][et_pb_text _builder_version=&#8220;4.9.4&#8243; _module_preset=&#8220;default&#8220; custom_padding=&#8220;||2px|||&#8220;]<\/p>\n<h2 style=\"text-align: left;\">HTR engines evaluated<\/h2>\n<p><span>We have chosen four systems to evaluate:<\/span><\/p>\n<p><strong><span>Tesseract<\/span><\/strong><span> version 4, a free Open Source (OS) system originally developed by Hewlett-Packard, then for some years sponsored by Google.<\/span><\/p>\n<p><strong><span>Calamari,<\/span><\/strong><span> a free OS system derived from OCRopy and Kraken.<\/span><\/p>\n<p><strong><span>Transkribus<\/span><\/strong><span>, a commercial offering by READ Corporation, Austria. The system was originally developed in the EU founded projects tranScriptorium and READ (Recognition and Enrichment of Archival Documents)<\/span><\/p>\n<p><strong><span>Glyph<\/span><\/strong><span>, a still experimental closed source system developed by a group of researchers at Polytechnic University of Bucharest.<\/span><\/p>\n<p>[\/et_pb_text][\/et_pb_column][\/et_pb_row][et_pb_row _builder_version=&#8220;4.9.4&#8243; _module_preset=&#8220;default&#8220;][et_pb_column type=&#8220;4_4&#8243; _builder_version=&#8220;4.9.4&#8243; _module_preset=&#8220;default&#8220;][et_pb_text _builder_version=&#8220;4.9.4&#8243; _module_preset=&#8220;default&#8220; custom_padding=&#8220;||2px|||&#8220;]<\/p>\n<h2 style=\"text-align: left;\">Aim and method of evaluation<\/h2>\n<p><span>We aim to evaluate <u>HTR Engines<\/u> as a technology. We do not evaluate <u>HTR Services<\/u>. Therefore, OCR services by Google, Microsoft or Amazon are not included in the list above. They are certainly great services, but the technology is tied with the models and any assessment depends strongly nature of the test data used. In our projects we are often faced with old handwriting in less common languages that require training of specific models. Thus, we are more interested in the technical capabilities of engines to train custom models. <\/span><\/p>\n<h2 style=\"text-align: left;\"><\/h2>\n<p>[\/et_pb_text][\/et_pb_column][\/et_pb_row][et_pb_row _builder_version=&#8220;4.9.4&#8243; _module_preset=&#8220;default&#8220; min_height=&#8220;47.2px&#8220;][et_pb_column type=&#8220;4_4&#8243; _builder_version=&#8220;4.9.4&#8243; _module_preset=&#8220;default&#8220;][et_pb_text _builder_version=&#8220;4.9.4&#8243; _module_preset=&#8220;default&#8220; min_height=&#8220;53.6px&#8220; custom_margin=&#8220;||-58px|||&#8220; custom_padding=&#8220;||0px|||&#8220;]<\/p>\n<h2 style=\"text-align: left;\">Ground Truth<\/h2>\n<p>[\/et_pb_text][\/et_pb_column][\/et_pb_row][et_pb_row column_structure=&#8220;3_4,1_4&#8243; _builder_version=&#8220;4.9.4&#8243; _module_preset=&#8220;default&#8220;][et_pb_column type=&#8220;3_4&#8243; _builder_version=&#8220;4.9.4&#8243; _module_preset=&#8220;default&#8220;][et_pb_text _builder_version=&#8220;4.9.4&#8243; _module_preset=&#8220;default&#8220; custom_padding=&#8220;||2px|||&#8220;]<\/p>\n<p><span>Our set of ground truth comprises of 590k words in Ukrainian (Cyrillic) language and was collected in a single project. The original documents are from 18th and 19th century. Training of models was started \u201cfrom scratch\u201d, no pre-trained models were used for re-training. We used 98% of our data for training. A 2% random set of the data was excluded from training and used for evaluation. No post-processing like dictionary checks were used in this evaluation. <\/span><\/p>\n<p>[\/et_pb_text][\/et_pb_column][et_pb_column type=&#8220;1_4&#8243; _builder_version=&#8220;4.9.4&#8243; _module_preset=&#8220;default&#8220;][et_pb_image src=&#8220;https:\/\/content-conversion.com\/wp-content\/uploads\/2024\/01\/Cyrillic-sample.png&#8220; alt=&#8220;Cyrillic Handwriting Sample&#8220; title_text=&#8220;Cyrillic Handwriting Sample&#8220; _builder_version=&#8220;4.9.4&#8243; _module_preset=&#8220;default&#8220;][\/et_pb_image][\/et_pb_column][\/et_pb_row][et_pb_row _builder_version=&#8220;4.9.4&#8243; _module_preset=&#8220;default&#8220;][et_pb_column type=&#8220;4_4&#8243; _builder_version=&#8220;4.9.4&#8243; _module_preset=&#8220;default&#8220;][et_pb_text _builder_version=&#8220;4.9.4&#8243; _module_preset=&#8220;default&#8220;]<\/p>\n<h2 style=\"text-align: left;\">Line Segmentation<\/h2>\n<p><span>HTR engines usually segment images into lines first. Training and recognition are based on single lines. In our experience, the quality of the line segmentation can have a strong impact on the recognition rates. To understand this effect, we did evaluations twice, once with the internal line segmentation provided by the HTR engine and again with an external line segmentation framework. These evaluations include the training of the models. <\/span><\/p>\n<p><span>In the external case, the engines only \u201csee\u201d images with one line of text. For external line segmentation, we used the OCR-D framework, an OS framework funded by the \u201cDeutsche Forschungs\u00adgemein\u00adschaft\u201d. With Transkribus, external line segmentation was not used for technical reasons. Calamari does not have an internal line segmentation. Glyph performed slightly better with its internal segmentation. Tesseract improved significantly from external line segmentation.<\/span><\/p>\n<p>[\/et_pb_text][\/et_pb_column][\/et_pb_row][et_pb_row _builder_version=&#8220;4.9.4&#8243; _module_preset=&#8220;default&#8220; custom_padding=&#8220;||0px|||&#8220;][et_pb_column type=&#8220;4_4&#8243; _builder_version=&#8220;4.9.4&#8243; _module_preset=&#8220;default&#8220;][et_pb_text _builder_version=&#8220;4.9.4&#8243; _module_preset=&#8220;default&#8220; min_height=&#8220;23.6px&#8220; custom_padding=&#8220;||0px|||&#8220;]<\/p>\n<h2 style=\"text-align: left;\">Results<\/h2>\n<p>[\/et_pb_text][\/et_pb_column][\/et_pb_row][et_pb_row column_structure=&#8220;2_5,3_5&#8243; _builder_version=&#8220;4.9.4&#8243; _module_preset=&#8220;default&#8220;][et_pb_column type=&#8220;2_5&#8243; _builder_version=&#8220;4.9.4&#8243; _module_preset=&#8220;default&#8220;][et_pb_text _builder_version=&#8220;4.9.4&#8243; _module_preset=&#8220;default&#8220; custom_padding=&#8220;||0px|||&#8220;]<\/p>\n<p><span>We use the Levenshtein Distance as metric to count errors. The percentage values are calculated based on characters (including blanks).<\/span><\/p>\n<p>[\/et_pb_text][\/et_pb_column][et_pb_column type=&#8220;3_5&#8243; _builder_version=&#8220;4.9.4&#8243; _module_preset=&#8220;default&#8220;][et_pb_image src=&#8220;https:\/\/content-conversion.com\/wp-content\/uploads\/2024\/01\/HTR-Benchmark-results-graphic.png&#8220; alt=&#8220;HTR Benchmark results&#8220; title_text=&#8220;HTR Benchmark results graphic&#8220; _builder_version=&#8220;4.9.4&#8243; _module_preset=&#8220;default&#8220;][\/et_pb_image][\/et_pb_column][\/et_pb_row][et_pb_row _builder_version=&#8220;4.9.4&#8243; _module_preset=&#8220;default&#8220;][et_pb_column type=&#8220;4_4&#8243; _builder_version=&#8220;4.9.4&#8243; _module_preset=&#8220;default&#8220;][et_pb_text _builder_version=&#8220;4.9.4&#8243; _module_preset=&#8220;default&#8220;]<\/p>\n<h2 style=\"text-align: left;\">Conclusion<\/h2>\n<p><span>Overall, we are positively surprised that a less than 10% error rate was achieved by two engines without postprocessing. Transkribus gives the best results. It not only gives the best results but has proven to be stable and robust against changes in style of writing in other contexts. Glyph performs surprisingly well, especially in consideration of its minimal development budget. Calamari as a free Engine performs quite well but falls short of Transkribus. Tesseract disappoints and seems not to currently be an option for handwriting even though it gives very good results for printed material.<\/span><\/p>\n<p><span>Our data has a focus on names, numbers, and places. Even though we have not seen indicators of it, there may be a risk that this characteristic of the training data impacts the relative performance of the engines. <\/span><\/p>\n<p>[\/et_pb_text][\/et_pb_column][\/et_pb_row][\/et_pb_section]<\/p>\n","protected":false},"excerpt":{"rendered":"<p>HTR Bechmark (Handwriting TRanscription) &nbsp;Transkribus performs best on Cyrillic Handwriting, compared with Tesseract, Calamari and Glyph.Introduction In over 20 years of experience in content conversion, CCS has collected many datapoints on performance of various OCR engines. From time to time, we conduct evaluations and provide them to our partners. After integrating HTR into our product [&hellip;]<\/p>\n","protected":false},"author":3,"featured_media":6884,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"_et_pb_use_builder":"on","_et_pb_old_content":"","_et_gb_content_width":"","footnotes":""},"class_list":["post-7117","page","type-page","status-publish","has-post-thumbnail","hentry"],"_links":{"self":[{"href":"https:\/\/content-conversion.com\/de\/wp-json\/wp\/v2\/pages\/7117","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/content-conversion.com\/de\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/content-conversion.com\/de\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/content-conversion.com\/de\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/content-conversion.com\/de\/wp-json\/wp\/v2\/comments?post=7117"}],"version-history":[{"count":11,"href":"https:\/\/content-conversion.com\/de\/wp-json\/wp\/v2\/pages\/7117\/revisions"}],"predecessor-version":[{"id":7511,"href":"https:\/\/content-conversion.com\/de\/wp-json\/wp\/v2\/pages\/7117\/revisions\/7511"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/content-conversion.com\/de\/wp-json\/wp\/v2\/media\/6884"}],"wp:attachment":[{"href":"https:\/\/content-conversion.com\/de\/wp-json\/wp\/v2\/media?parent=7117"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}