Full-text search system FTS

FTS – Web application for search by digitized books and dissertations. Allows searching not only by bibliographic data, but also through the whole document text. The application processes large text documents, in particular scanned and recognized, including low-quality old scanned copies.

What we did

  • Full-text search for large amounts of data.

    Unlike site search each document is larger in the tens or even hundreds of times at book search. Besides, not all fields are equally important. Right balance between advanced search capabilities and cutting off all the excess allowed us to achieve high performance. Thus, we achieved the rate of 10 requests per second when searching by the storage of more than 600000 documents (more than 300 Gigabytes of text), using only one server with dual-core processor. If necessary, the search can successfully use the resources of multiple servers. The effectiveness of resource usage and willingness to scale enabled the introduction of a search system without expanding customer’s server infrastructure.

  • Morphology support.

    Russian language has a variety of forms of each word. The version of truncating the end of the word which is effective for the English language works poorly for the Russian language. To solve this problem we used an open source solution ATP (automatic text processing), which allows putting words to the normal form. Finally, we got more relevant search results both by the set of test queries and subjective feelings of focus groups.

  • Rapid changes in search base data.

    Legislative requirements must be met regardless of the system features. It happens that something should be added or removed from the search. Our indexing system allows adding and deleting documents manually. And we certainly have not forgotten about the possibility of the search system to work for years without manual interventions.

  • Combining search implementations.

    Any technical solution is useful if it solves non-technical problems. To remove the plurality of separate search implementations, we gathered the requirements and implemented interaction interface (API), which can be used by the developers of other systems. As a result, the search began to work consistently and servers busy with other search implementations got unallocated. Furthermore, the process of search adding to a new system is fast and cheap.

  • Documentation for developers.

    As the decision has API, in addition to the usual set of documents, we have written the documentation for developers, including examples in several programming languages.

Testing

The project used mercurial, redmine, jmeter. Details - on The way we work page.

More technical details

  • Operating system: Ubuntu x64
  • Search server: Sphinx search
  • Indexer: c#, mono runtime
  • Web interface: programming language php, using php-fpm process manager
  • Data storage: MySQL as primary storage, Redis for frequently used data

The specificity of the input data (recognized documents) complicated both indexing and results formation. Our team has developed the algorithm of preliminary text filter that removes characters that are “garbage”. Since filtering was built on the set of rules we together with customer’s representatives have formed the set of test queries to control each of them. This allowed us to remove the excess without affecting the document body, even if the word was recognized with an error.

During the project implementation a few errors in the use of open source solutions were found. We have implemented patches for some of them to fix errors in our environment.