Regexus-Occulto!

March 18, 2014 · OpenSource Dev Regex

This is part four of my series on the creation of a regular expression DSL. If you missed them, here are the other parts:

As I said in the third part of this series, bugging my friend Ghusse day-in-day-out with SimpleExpression, he finally couldn't hold it any longer, got his hands (really) dirty and rolled out the base of another approach that he named MagicExpression("Magex"). Since then, we've grown "Magex" into a beautiful open source (MIT license) DSL that you can find on Github.

TL; DR; it's looking good!

In Short, MagicExpression loses the dynamic part of the SimpleExpression and comes back to a fluent API. Magex Replaces parts of the commands with their less-funky but non-ambiguous functional equivalents and as a direct consequence, gets rid of the cumbersome Abstract Syntax Tree and thus of half of the complexity of SimpleExpression's implementation. Finally Magex pushes the DSL way further than SimpleExpression was ever able to by adding many regular expression concepts to the equation.

The basics

In order to use MagicExpression, just install the following Nuget Package:

Install-Package MagicExpression

In your code, start by instantiating a Magex Object via:

var magicWand = Magex.New();

You're good to go. No need to add a .Generate() at the end of the chain, the expression is parsed as you go and each part of the chain returns a Magex object whom you can ask for its .Expression.

"MagicExpression for muggles" (©Ghusse)

Instead of diving into the the full extend of the MagicExpression I will concentrate on some important semantic choices, the differences with SimpleExpression and mostly detail them by the example.

For a more thorough explanation of the API, head over to the homepage of the Github page where you will find some of those examples and a list of all the supported functions... or use Intellisense ;)

Example 1: Matching a floating point number
var magicWand = Magex.New();

magicWand.Character('-').Repeat.AtMostOnce()
     .CharacterIn(Characters.Numeral).Repeat.Any()
     .Character('.')
     .CharacterIn(Characters.Numeral).Repeat.AtLeastOnce();

// Creates a regex corresponding to
// -?[0-9]*\.[0-9]+
// We can now stick it in a regex object and start matching
var floatingPointNumberDetector = new Regex(magicWand.Expression);

// Will match "1.234", "-1.234", "0.0"
// Will not match "0" "1,234", "0x234", "#1a4f66"

As you can see, the MagicExpression syntax is much closer to a programming language than Shakespearean prose. It is still very easy to read, but more "boiler-plate code" has made its way into the language.

You can already spot some differences (with SimpleExpression):

  1. Element sets are handled via the generic CharacterIn() function which behavior can be (among others) tailored via an enumeration parameter
  2. A repetition block is triggered via the .Repeat function
  3. Optional blocks are handled via... an .AtMostOnce() repetition (smart!)
Example 2: XML Tag Matching

Grouping is done two ways. Via the Group() function to create a non-capturing group or via the Capture() and CaptureAs() functions to... capture a group. The Backreference() function can also be used in combination with CaptureAs().

Here is an example using capture and backreference to a previously defined group (neat):

var magicWand = Magex.New()
    .Character('<')
    .CaptureAs("tag", x =>  
        x.CharacterNotIn('>').Repeat.AtLeastOnce())
    .Character('>')
    .Character().Repeat.Any().Lazy()
    .String("</")
    .BackReference("tag")
    .Character('>');

var badHtmlTagDetector = new Regex(magicWand.Expression);

// Matches "<strong>hello world</strong>" & "<h1>A title</h1>"
// Doesn't match "<h1>A tag mismatch</strong>"
Example 3: URL Matching

In order to match an URL that can start with http or ftp, I can do the following:

var magicWand = Magex.New();

magicWand.Options = RegexOptions.IgnoreCase;

const string allowedChars = @"!#$%&'*+/=?^_`{|}~-";

magicWand.Alternative(
    Magex.New().String("http"),
    Magex.New().String("ftp"))
.Character('s').Repeat.AtMostOnce()
.String("://")
.Group(Magex.New().String("www."))
    .Repeat.AtMostOnce()
.CharacterIn(Characters.Alphanumeric, allowedChars);

There are (at least) two things to notice here. First of all the option on the Magex itself to ignore the case and then the double parameter on the CharacterIn(params char[]) function.

Example 4: IP Address

Here's how one can match an IP address:

Magex.New()
    .Builder.NumericRange(1, 255).Character('.')
    .Builder.NumericRange(0, 255).Character('.')
    .Builder.NumericRange(0, 255).Character('.')
    .Builder.NumericRange(0, 255);

As you can see, there is a .Builder property that gives you access to some predefined functions like NumericRange(). You could also use the Literal(string) function and pass in a regular expression ; but in this case, why bother when we covered those nasty number ranges for you?

Example 5: Just for fun

Can you guess what this matches?

Magex.New()
    .Character('#')
    .Alternative(
        Magex.New().CharacterIn(
                Characters.Numeral, "abcdefABCDEF")
            .Repeat.Times(6),
        Magex.New().CharacterIn(
                Characters.Numeral, "abcdefABCDEF")
            .Repeat.Times(3)
            .EndOfLine());

And this one?

Magex.New()
    .Character('0')
    .CharacterIn("xX")
    .CharacterIn(Characters.Numeral, "abcdefABCDEF")
    .Repeat.Times(6);

I'm sure you got it, both are Hexadecimal matches, one for an hex on the format 0x123abc, the other one for an Hexadecimal Color type #123abc or #1ac.

What now?

Trading a bit of semantic sexiness for a lot of precision, MagicExpressions already have a very solid base both in concepts and implementation. There are still some little semantic changes on the way, small things here and there that we still don't quite like, but for the most part it's already a quite stable product. Do you have some other ideas? If so, don't hesitate to hit us with your wishes or if you want to fork this for another language, we'd be honored!

There is one more major feature we'd like to investigate: reverse engineering existing regular expressions to produce neat MagicExpressions. Wouldn't that be awesome... and magic... and did I say awesome?

As for SimpleExpression? I do not know what will happen to it. We have a few other projects cooking already (including MagicExpression) so it might end up retiring earlier than planned... but only time will tell!

Comments powered by Disqus