Dates, grammars and natural language

[Versión en castellano]This link opens in a popup window

Data is pivotal for any organization, and, as it has always been, the more structured it is the better. Moreover, a proper structure, should allow us, or better a machine, to know what the data is talking about, that means that it should provide enough information as to be used for searching, indexing, and discovering purposes.

One of the most challenging data types is dates, a complex type which embeds information as year, month, day of month and many other aspects which, from a data analysis perspective, could be of utmost importance. Besides its inner complexity, when managing dates we could deal with some problems as periodicity, an event that repeats itself according to a set of rules. If we want to manage it properly, search over it, and all those things that matters we will probably have to store actual dates, that is expand the date guessing when the event will happen in the future as to provide our end-users with the ability to search in.

Furthermore, what this repeating is about. On the one hand, and from an end-user point of view, you could think in some event that happens each Monday, or even each year or month. On the other hand, and if you are developer or a system admin, you know, for certain, that there are more complex scenarios which require using crontab expressions. Unfortunately, crontab expressions, when dealing with complex requirements, are not easy to set, or even require some rules that are not available in all of systems. Lets’ think about some, not so atypical, demands.

  • An event that happens the nearest weekend to the 15th of august.
  • An event that happens the Monday after Easter Monday.
  • An event that happens the Sunday after Corpus Christi
  • An event that happens last Sunday of April if this Sunday isn’t Easter Sunday, in that case the Sunday after.

portadaportada

The problem

As a matter of fact, creating some of these expressions would require you to introduce some weird eval expressions or even developing your own custom code, or copying it from ChatGPT. Be that as it may, the problem is not easily solvable, and, for certain, out of reach for the average person. Obviously, you could think that these expressions are made up ones but, in fact, they are actual expressions used in some Spanish festivals.

Therefore, if you are designing a system in which you are dealing with these kinds of periodicities, you must provide some mechanisms to ensure that data is properly usable, that means, structured in a proper way. There are different simple solutions, although all of them have their problems.

  • We could let our users to guess the date we are talking about, but surely, they will not be able to search, or to plan their trip for the year after.
  • We could send reminders to system admins, letting them update the information each year for the two or three years after. In fact, this is a so repetitive, boring, and prone to errors task, that, for sure, in some years data won’t be updated.

Hence, what should we do? Automatize it! Throughout this post, I’m going to show you how, at Divisa iT, we are providing a solution to this issue using it in our projects.

Analysing the subject

As you know, when develop solutions we must think on our users, both system admins and end users. In this case, end user demands are properly expressed, they need to know when something is going to happen, without using a calendar, or querying it at Google. But, regarding system administrators, what do they need?

  • Introducing the expression in a normal, natural way, avoiding arcane and esoteric spellings.
  • The expression should be resolved and managed by the system on its own, the less they need to work the better.

Tackling this problem efficiently require us to consider different issues. Perhaps we need to support natural language, since we are in 2024 we could be tempted to use an AI tool. But, as a matter of fact, sometimes, and this is one of those times, we need a deterministic approach rather than a probabilistic one. Not to mention, that there are and had been simpler approaches which could help us to fulfil this goal. I’m talking about grammars.

As you know, a grammar is a set rules which define the structure of a language, its syntax, and its morphology. In computer science, when we refer to grammars we are talking about some lexical and parsing rules which allow us to transform the expression to actual working code. Hence, the solution to our problem requires defining this grammar, afterwards processing it to, eventually, generate actual dates.

Moreover, we have a computer science problem, therefore we need to infer the conditions we are going to support to in order to create a proper grammar and a proper object modelling. At the other end of the spectrum, we don’t want a tight coupling between grammar and what we are really doing, so we need to create a way to intercommunicate both ends.

Conditions

In normal language we are used to use two kinds of expressions when dealing with dates:

  • Adverbs of time,before, at, after and near to name but a few.
  • Qualifiers, which allows us to talk about a specific day, a weekday, a week, a month, a fortnight, whatsoever.

Thus, our grammar and our object modelling should be able to support these expressions. Graphically, we could imagine our date as an object which behaves as:

date-schemadate-schema

In summary, our date is transformed by these adverbs and qualifiers generating a new date that could be used in an iterative way until the expression is fully resolved.

The grammar

For grammar definition we have used ANTLR capabilities, this tool includes both a lexical analyser and a parsing tool, which allows to generate a tree (Abstract Syntax Tree). As I stated before, we want to support natural language, that means that we have to create a set of lexical and parser rules for each language we want to support.

In the following example, you could see lexer and parser rules, in Spanish.

grammargrammar

You could note that we are following the iterative approach previously mentioned, we have a main date expression – the atom – which is transformed by an at, used subsequently by the before, after or near expression. All of them are embedded in upper-level operations, allowing the desired recursion as shown in the following example:

primer lunes del domingo de pascua al primer domingo de junio

(first Monday from Easter Sunday to the first Sunday in June)

grammar-samplegrammar-sample

Expression is, certainly, a bit strange but we can model simpler and more normal things,

Tres domingos después del Corpus Christi

(3 sundays after Corpus Christi)

grammar-sample2grammar-sample2

Even, we can support logical conditions, as if-else based ones.

def christmas: 25 de diciembre

def eve: 31 de diciembre

si christmas es igual ultimo sabado de diciembre entonces eve si no 1 dia despues de eve

(def Christmas: December 25th

Dev eve: December 31st

If Christmas equals to last Saturday in December then eve else one day after Eve)

grammar-sample3grammar-sample3

The software system

When developing software systems, we want to avoid coupling, because not only could we use created interfaces from different places, but also because it provides us with better testing capabilities. Hence, isolating grammar from actual code is pretty important. Considering this point of view, we could imagine our system as a simple piece of code with different input methods,

modelomodelo

It follows the recursive approach already mentioned providing a parse method to calculate the date expression from a natural language one, needing, obviously, user locale, to know whether the week starts on Sunday or Monday; time zone; and if weeks should be computed fully, that is from Monday/Sunday to Saturday/Sunday or not.

Fitting pieces together

Fitting together grammar and object model isn’t a hard task and a very simple approach is implementing this interaction as a stack machine, in which all data is pushed on stack when we process it, and popped from it whenever we need. You should consider that it grammar rule should be treated as an independent function, which its own stack, storing its result in parent stack with no need of return.

stack-samplestack-sample

Using the system, could resolve expressions in a simple way and ready to use in an actual system.

date-resolutiondate-resolution

To know more

If you are further interested in how this is actually implemented you could check this github repositoryThis link opens in a popup window (Spanish only grammar), and its even published in maven central.