Toddler steps with Simple Expressions

March 12, 2014 · OpenSource Dev Regex

In the first part, birth of SimpleExpressions, I described how I came to the idea of writing a DSL to help write Regular Expressions. In this second part, I'll tell you a bit more about its semantic.

The whole point was to create something that reads well, that - in contrary to regular expressions - is easy to grasp later on and can potentially abstract away (most of) the regular expression talk from the equation.

The basics

In order to use SimpleExpressions, just install the following Nuget Package:

Install-Package SimpleExpressions

In your code, start by instantiating a SimpleExpression object, and store it as a dynamic. You can then chain functions to this object to create a command and finish your command with a .Generate(); to tell the parser that the end of the expression was reached.
Once your command is fully created, you can access the produced regular expression pattern via .Expression. You can also directly use the helper functions IsMatch(string) and Matches(string) to directly use the produced Regular Expression.

Here is a summary:

//Declaration
dynamic se = new SimpleExpressions();

//The command
SimpleExpressions result = se.Letters.Generate();

//Matching
bool success = result.IsMatch("something");
MatchCollection matchCollection = result.Matches("something");

By this time, you must have noticed that Intellisense is not doing its job right. Well, that's the drawbacks of dynamics. The functions you are calling don't really exist, they are just conventions that both you and the SimpleExpression logic have in common ; thus no Intellisense support. But don't worry, 90% of the functions will be listed in this post so there cannot be that many ;)

The language

Here comes the interesting part, the semantic of the language itself.

Assuming you have a SimpleExpression object se that you manipulate as a dynamic, you can form the following commands (note: two functions chained one after the other would mean "the first one, then the second one").

se.Letters.Numbers.Alphanumerics
se.Text("a_string_to_match")
se.One('c')
se.OneOf("aeiouy") //One of those letters
se.EitherOf("http|ftp") //One of those strings
se.Text("http").Maybe("s") //Optional "s"

One painful part of the regular expressions is the numeric ranges. In order to signify "a number between 0 and 255" for instance, you have to produce a weird expression, which becomes trivial with SimpleExpressions:

se.NumberInRange("0-255")
//Outputs "[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]"
// e.g. a number in 0-9 or 10-99 or 100-199 or 200-249 or 250-255

Since we have Letters, Numbers and Alphanumerics as fixed ensembles to help us, we need functions to add and subtract something to and from those ensembles.

se.Letters.And("-_~=%$&")
se.Letters.Except("a")

Grouping elements is a very important part of the Regexes. Among other things, it allows us to capture named or unnamed groups. In SimpleExpression it is done via a combination of two or three functions: .Group.X.Together.As(name). The Group and Together keywords act as "(" and ")" in this case, while the As(string) gives a name to the captured group:

se.Group
    .Text(http)
    .Maybe(s)
  .Together
  .As("protocol")

Repetitions are also easy:

se.Letters.AtLeast(3)
se.Letters.AtLeast(1).AtMost(5)
se.Letters.Exactly(4)

se.Group.Exactly(10)
    .Numbers
  .Together

Last but not least, the Or operator allows us to push the EitherOf idea further. The Or applies to the previous and next elements of the chain:

se.Text("http")
  .Or
    .Text("ftp")
  .Text("://")

Wrapping up

As you can see, the SimpleExpressions functions are really simple and tend to produce very readable expressions. Try reading some out loud, you'll see.

Here is an example how to match everything that comes after the @ in an Email address (courtesy of Ghusse):

string allowedChars = @"!#$%&'*+/=?^_`{|}~-";
se.Group
    .Letters.And(allowedChars).AtLeast(1)
    .Group
        .One(".")
        .Alphanumerics.And(allowedChars).AtLeast(1)
    .Together.As("dotAndAfter")
  .Together.As("afterAt")
.Generate();

Unfortunately the devil hides in the details and some cornerstones of the semantic have some major drawbacks that I must acknowledge. And that'll be for a third article.

Comments powered by Disqus