Full-text search, basic concepts

[Versión en castellano]This link opens in a popup window

Introduction

Unquestionably, one of the most typical problems that you have to face when you develop a software system is searching. As a matter of fact, I suppose that none of us consider searching as a complex task since you are typically using a storage system which is supposed to provide you with all the required capabilities, even though you could be requested to create some indexes in order to boost performance.

PostBusquedas_00PostBusquedas_00

Nonetheless, problems begin to appear when you have text, sentences or even long paragraphs. A first approach could involve using pattern-based-search but, as you know, the issue becomes a bit harder when you notice that your text includes mixed cased words or you don’t have only English characters, mainly those with an accent mark. To make things worse, this pattern-based-search is not very efficient when you need to find a word in the middle of a paragraph.

However, you could argue that most storage systems support some kind of full-text capabilities. You are right, but I wonder if you have ever thought about what happens under the hood. If you are considering whether this is actually important or not, let me tell you something, comprehending the basis can provide new ideas, innovation is based on former knowledge and there are always situations in which standard capabilities are not enough for your use case. Besides, if you are interested on NLP, consider that although these systems go further, they use some of the same principles as full-text search.

At Divisa iT we have used this approach to build up Proxia’s out-of-the-box search engine. In our case I opted for this in-house approach to address two different problems, on the one hand to provide our clients with a database agnostic solution and, on the other hand, to avoid third-party systems usage unless they are already used by our clients.

Describing the problem

When analyzing the problem, we have to consider that there is not an actual correlation between stored text and what the user looks for. We store long paragraphs, a plethora of sentences and different types of words, verbs, nouns and adjectives to name but a few, yet the user is searching for terms, sometimes misspelled and, probably, using a different derivative. Besides, your text could be in different formats namely plain text, PDF documents, Word documents or even Excel sheets.

PostBusquedas_01PostBusquedas_01

Therefore, as you could notice there are different issues to tackle in order to solve the problem completely,

  • Extracting text from its sources, that means finding information in word, excel and PDF format nightmare.
  • Word extraction, as I have told you we have paragraphs and sentences, but since the user is searching for terms we need to find those in our text. That involves, in a first step, finding the words. This could seem as an easy task, a word could be distinguished considering, for instance, punctuation mark, spaces, tabs, etc. But the devil is in the details, and a language is something so complex that trying to apply simple rules not always fill the gap smoothly.
  • Transforming words into tokens, in order to make matches, we have to convert both stored word and user-introduced terms into the same concept. That means, using a linguistic approach, extracting its lexeme. Unfortunately, this is a quite complex task, although we could use different algorithms to solve it.
  • Correcting misspellings, it is unavoidable user is going to write words with misordering and letter omission errors. It’s necessary to use algorithms and storage techniques that could help us to address this problem.
  • Supporting several terms search, if the user introduces several terms we have to support different combinations, all words, some of them, or all of them in the same order to name a few.
  • Sorting results, this is really tough task, since there are two very different approaches to the system. The end-user point of view, and the information-owner point of view. The former wants the most relevant results, the latter the most relevant according to his organization interests. Finding balance is not easy.

PostBusquedas_02PostBusquedas_02

A lot of things to think about, isn’t it? There are many more items involved, for example, do all the words have the same importance when searching? do all users have access to all information? Describing all these things properly is a humongous task, there is really a plethora of information on the Internet that could help you to grasp a better understanding, so if you are really interested go for it, this book This link opens in a popup windowhas very good points and it helped me a lot. Here, I’ll give you just a glimpse of how we are addressing this problem at Divisa iT.

Extracting text and words

My best advice, do not try to reinvent the wheel, there are very good open-source projects that solve this problem properly.

  • PDF is a really horrible document type, in fact more graphical than a documental one. Using tools like PDFBoxThis link opens in a popup window could help you to achieve your goals but, on the other hand, you will have to develop some heuristics in order to detect footer and headers sections, lines, sentences, and all those important things.
  • If you need to manage Office or OpenOffice documents, POIThis link opens in a popup window is the tool for you. Quite handy, and ready to use.
  • When extracting words, you have to use some word tokenizer algorithm, perhaps an easy one could be enough for you, considering only punctuation marks or spaces. If you need something already built, try OpenNLP tokenizerThis link opens in a popup window. If you need to go deep inside, you should perhaps consider acronyms, abbreviations and all those language related matters.
  • As I told you previously not all the words are equally important, for example, a preposition is not very significant. So, you should "delete" it from your word list. You need what it is called a stop-word list. Find one on the Internet, there are many more that the ones you expect.
  • Finally one last step could be detecting sentences, in fact this sentence detection is utterly related to detecting word endings. Knowing when a sentence starts or ends is pivotal when contextualizing a literal search.

Tokenization, token relevance and storage

First of all, extracting word lexemes is tough and complex. Do not try to do it on your own. Good algorithms which could help you are the following ones:

  • Porter Stemming AlgorithmThis link opens in a popup window, there are different solutions for different programming languages and human ones – Spanish, English, etc. -. This algorithm is not really extracting the proper lexeme but an approximation considering some word formation rules.
  • Using HunspellThis link opens in a popup window to extract lexemes. This algorithm uses a word database so its results are more accurate than Porters’ ones. On the other hand, it’s slower and trickier to use, and when using JVM you will have to use JNI to access its functionality.

PostBusquedas_03PostBusquedas_03

Be your election as it may, once tokenization step has finished it’s the moment of storage to be done. Former to this, token weight must be computed to control its importance or relevance within the document. You could think that this is an easy task, since the more times a token appears the more relevant it is. Of course, you are right, but besides frequency you should consider other aspects as, for example,

  • Not all sentences or paragraphs have the same importance. Namely, a summary is less important than title, but more important than global description.
  • Since we are going to compare different documents the number of token occurrences is not enough. I mean, it’s not the same two occurrences of word "submit" in a four-word document than in an eighty-word one. Addressing this issue involves, for instance, using a Euclidean DistanceThis link opens in a popup window weight algorithm.

Finally, if your solution is a database centric one, which its own RBACThis link opens in a popup window subsystem, I would recommend using an Index Organized Table approach since its performance is much better than a traditional table, although inserts and updates are penalized. As a matter of fact, there are many other issues to consider, for example:

  • If you want to highlight search results you will need to store word offsets within the document.
  • If you want to resolve misspelling problems, not only will you need to use a spelling algorithm like Levensthein distanceThis link opens in a popup window, but you will also have to store bi-grams, for instance the word "submit" will be divided into "su" "ub" "bm" "mi" and "it" bigrams. Using these bigrams and Jaccard indexThis link opens in a popup window our system will be able to find candidate suggestions.
  • If your intention is to allow your end-users to perform "literal queries", you will also have to store sentence n-grams. For example, a sentence like "this is a test" will have the following equivalence "this$is$a" "$is$a$test" "$a$test$", "$test$$"

PostBusquedas_05PostBusquedas_05

Searching and scoring

Final step is performing queries, as you can suppose the storage structure explained before makes easier to know which documents contain a concrete token. Nevertheless, queries don’t usually involve searching by only one token, in fact we have to consider different criteria and different conditions. Actually, executing the query is not tough at all, we only have to take into account how bitwise operation could help us.

PostBusquedas_07PostBusquedas_07

This means that,

  • An AND search is fulfilled if the sum of the assigned bit to each token is equal to the total expected value.
  • An OR search is fulfilled when the sum of the assigned bit to each token is distinct from zero.

Last but not least, I should talk about how to sort search results. Weight, as computed before, is a good starting point, but only if all documents are equally important. Different algorithms could be applied and all of them depends on your use case and user and end-user expectations, some examples that we are using include:

  • Give each document type a weight according to its supposed importance.
  • Consider document visits, the more accessed the document is, the more weight it has. But, we have to cut down the importance of old accesses, so the older a visit is the less important it is.

PostBusquedas_04PostBusquedas_04

Summary

Performing text-based queries is not actually complex, but there are different issues to be considered. In my opinion, just learning about lexemization, different algorithms and how it could be used to improve your search systems, even when working with in-memory approaches, it’s worth the time invested.

On the other hand, text-based search is only one of the different types of queries that can be performed. When you try to find information about namely states, towns, user names, to name there are perhaps better solutions, depending upon your use case, as TrieThis link opens in a popup window algorithms.