A bitter gulp of SimpleExpression

March 16, 2014 · OpenSource Dev Regex

This is part three of my series about the SimpleExpression DSL. Head up to the first part for more info on the idea itself or to the second part for an explanation of the semantic of the DSL.

This project has been in my mind for quite a while now and many colleagues (thanks Michael, Andreas, Miwi & Schubsi) and friends (thanks Ghusse) had the utter chance (can I call it that?) to help me theorize on the semantic of the language... or play extensively with the various implementations along the way.

With all their helpful feedback, I tried to model this DSL in an always more useful way, while trying to keep the readability to its maximum, and finally implemented it as such. Upon writing those lines, I reached a first milestone, which helped draw some partial conclusions already.

Syntax

From a grammatic standpoint, the following expressions truly make sense and are very easy to read out loud:

.Letters.AtLeast(3).AtMost(4)
.Text("abcd").Except("a")
.Group.Text("aeiouy").Together.As("SomeLetters")

You get the point of the expression in an instant. Don't you?

Considering the ensemble Letters, the AtLast and AtMost qualifiers make sense. They define some kind of count. But event if the concept is exactly the same, applying those to groups doesn't work that well from a grammatic standpoint. In order to read the following right, you need to say 'twice' and not 'two':

.Group.Text("aeiouy").Together.Exactly(2)

It's still understandable, but not that convenient.

What about mixing repetition of a group and repetition of elements? The following is what SimpleExpression currently allows:

.Group.AtLeast(4).AtMost(5).Numbers.Exactly(2).Together.One(' ').AtMost(2)

This matches "4 to 5 groups of 2 numbers followed by at most 2 spaces", but reading it up one could understand "group at least 4 and at most 5 numbers, twice".

The problem we have with this Group.AtMost.X.Exactly.Together.Y.AtMost structure comes from the AtMost.X.Exactly. We don't know right away which qualifier applies to the X. If were to put the group repetition after the together, we would just move the issue one step further. Exactly.Z.AtMost is again, not clear. This is semantically not easy to solve.

Last but not least, try to read this out loud:

.Group
  .Group
    .Text(abcd)
    .Group
      .Letters.And("-")
    .Together
  .Together
  .Text("cde")
.Together

Yep, not that easy. Encapsulation of loops is no fun for our minds.

SubExpressions

In order to ease that last case, I introduced the Sub() function, which allows you to define an expression and then reuse it in another one:

var se = new SimpleExpression();

var abcd = new SimpleExpression().Text("abcd").Generate();
var efgh = new SimpleExpression().Text("efgh").Generate();

se.Sub(abcd).Or.Sub(efgh).Generate();

This sure helps the readability. Using this, we can split up the encapsulated-grouping example we had before. This also gains a lot in readability, but it doesn't solve the inherent problem, just merely hides it:

var innerMostGp = new SimpleExpression()
        .Goup.Letters.And("-").Together.Generate();
var innerGp = new SimpleExpression()
        .Group.Text(abcd).Sub(innerMostGp).Together.Generate();
var outerGp = new SimpleExpression()
        .Group.Sub(innerGp).Text("cde").Together.Generate();

Implicit Cardinality

Let's consider the following example:

.EitherOf("a|b|c").AtLeast(2)

What is meant here? Is it "a, b or c, at least two of them" or "twice a or twice b or twice c"? And if it means the first one, then how can I express the second one?

Lose the experts

Before we go to the technical side of things (cause that was not technical here yet, was it?), if you can write regular expressions, you are probably left wondering what some of the function really translate to.

Does the following produce a character class [], a group () or a non capturing group (?:)? How do I do one or the other?

se.Letters.Except("aeiou").And("§$%&").AtLeast(2).AtMost(4)

This is the curse of all the developers that can write a bit of regular expressions already. They know what is possible, thus they backtrack and try to interpret it.

What about this one (that's an easy Email matching by the way)? Are the groups capturing per default? If so, we need another keyword to qualify a non-capturing group and thus complexify the sentence a bit more. The end of the command with a succession of Together and As is a bit odd and forces you to backtrack in order to fully grasp what belongs where.

string allowedChars = @"!#$%&'*+/=?^_`{|}~-";
se
.Group
  .Letters.And(allowedChars).AtLeast(1)
.Together.As("beforeAt")
.One('@')
.Group
  .Letters.And(allowedChars).AtLeast(1)
  .Group
    .One(".")
    .Alphanumerics.And(allowedChars).AtLeast(1)
  .Together.As("dotAndAfter")
.Together.As("afterAt")
.Generate();

Regular expressions are really concise and precise. Our language becomes really bloated and verbose if we try to match it concept for concept.

In other words, if we were to simplify the regular expressions, we could create a DSL that would be useful to novices, but that experts will throw away.

Dynamic

A dynamic API a-la Simple.Data means no Intellisense support, which in turns means having to learn a new language This is no biggie for all our friends in the dynamic languages, but in C# that's kind of an alien concept.

Simple.Data's commands are based on the structure of your database, that you (only) know and on a few Simple.Data functions that you still have to learn. With this mixing of concepts using dynamics really makes sense and provides a real added value. It is unfortunately not the case with SimpleExpressions.

In SimpleExpressions all the functions are already known and the API created can also be covered using a fluent interface which would allow IntelliSense support. In this regard, using dynamics becomes more of a hindrance than it truly adds any value.

I kind of knew that from the beginning and used this project as an exercise to get my hands dirty with dynamics. But from a product perspective it makes little to no sense to keep using them.

Abstract Syntax Tree

The SimpleExpression semantic cannot be linearly converted into a regular expression. As you can see in the example below, the position of the AtLeast and As blocks are swapped in regard to the position of their regular expression counterparts:

.Group.AtLeast(3).Text("something").Together.As("theName")
=> (<theName>something){3,})

In order to render this task a bit clearer, I created an Abstract Syntax Tree. I parse the expression and build an alternate representation of the command that will be easier to manipulate. It works quite well, but tremendously increases the complexity of the implementation.

What then?

Those were some of the major concerns I have and don't really know how to solve right now. The SimpleExpression semantic is really nice, but it's like death by a thousand paper-cuts for the user. There are so many edge cases where the grammatic doesn't fit quite that well, that it tends to pull down the concept as a whole.

There are ways to change the language to make it foolproof. That requires using parenthesis again, reordering some elements in a logical way instead of a grammatical order and thus losing some readability for the sake of precision. I tried to ignore those up to now to see how far I could bring the SimpleExpression concept.

I must have bothered Ghusse so much with those ideas that, he decided to give it a try and laid out the base of this second concept and named it "MagicExpression". Instead of rolling out my own again, I joined him and we are making very good progress. That'll be my fourth article of this series!

Comments powered by Disqus