Automatically Generating Regular Expressions

Computers Cannot Read Your Mind

A lot of people are looking for a program that can automatically generate regular expressions for them. The program would only take examples of valid matches as input, and produce the proper regular expression as output, inferring the user’s idea of “proper” as by magic. Unfortunately, no computer program will ever be able to generate a meaningful regular expression based purely on a list of valid matches. Let me show you why.

Suppose you provide the examples 111111 and 999999. Which regular expression should the computer generate?

  1. A regex matching exactly those two examples: (?:111111|999999)
  2. A regex matching 6 identical digits (\d)\1{5}
  3. A regex matching 6 ones and nines [19]{6}
  4. A regex matching any 6 digits \d{6}
  5. Any of the above four, with word boundaries, e.g. \b\d{6}\b
  6. Any of the first four, not preceded or followed by a digit, e.g. (?<!\d)\d{6}(?!\d)

As you can see, there are many ways in which examples can be generalized into a regular expression. The only way for the computer to build a predictable regular expression is to require you to list all possible matches. Then it could generate a search pattern that matches exactly those matches, and nothing else. Usually, providing an exhaustive list of matches is exactly what we’re trying to avoid. And when you do have an exhaustive list of all possible matches, an optimized plain text search processing the whole list at once will be as fast as or faster than a regex search. The plain text search can be optimized to scan the text only once, without backtracking like regular expressions do.

If you don’t want to list all possible matches, you need a higher-level description. Instead of providing a long list of 6-digit numbers, you simply tell the program to match “any six digits”. The regular expression syntax itself is one way to provide such a description. Regular expressions are powerful enough that they can describe any text that doesn’t depend on its context. “Any six digits” is written as \d{6} in regular expression syntax.

To make the higher-level description easy to work with, it needs domain knowledge. Matching a date between January 1st and March 31st is much easier if your tool or language knows what a date is. This is where regular expressions fall short. Regular expressions only know about characters. Essentially, a regular expression describes which character comes next, or which characters are allowed next.

      
Only US$ 39.95
Windows XP, Vista, 7, 8, 8.1, 10, and 11
100% satisfied or money back
free trial download

What RegexMagic Can Do for You

This is where RegexMagic comes in. RegexMagic knows about dates, and a whole host of other patterns. You can tell RegexMagic you want a date between January 1st and March 31st, and that you want it in yyyy-mm-dd format, simply by selecting the “date and time” pattern and setting its options. Once you’ve done that, RegexMagic magically spits out your regular expression.

In practice, most of the regular expressions you want won’t neatly fit into one of RegexMagic’s predefined patterns. If you mark 1.2.12 as a whole, RegexMagic will guess it’s a date (1 February 2012, German style), rather than a product version number. If want a regex that matches 3 numbers delimited by dots, mark the numbers and the dots separately into 5 fields. Select the integer pattern for the numbers and the literal text pattern for the dots. Then RegexMagic can again magically spit out your regex, even if it didn’t magically read your mind.

Once you have created your higher-level description in RegexMagic, which is called a RegexMagic formula, editing and customizing that description is trivial compared to editing a regex. If you decide that the parts of a version should be restricted to values between 1 and 255, simply set the limits on the integer patterns, and regenerate the regular expression.

Easily Create Regexes with RegexMagic