FTS – Web application for search by digitized books and dissertations. Allows searching not only by bibliographic data, but also through the whole document text. The application processes large text documents, in particular scanned and recognized, including low-quality old scanned copies.
The specificity of the input data (recognized documents) complicated both indexing and results formation. Our team has developed the algorithm of preliminary text filter that removes characters that are “garbage”. Since filtering was built on the set of rules we together with customer’s representatives have formed the set of test queries to control each of them. This allowed us to remove the excess without affecting the document body, even if the word was recognized with an error.
During the project implementation a few errors in the use of open source solutions were found. We have implemented patches for some of them to fix errors in our environment.