15/02/2024 6 Minutes read

I use regular expressions for mundane tasks as well as complex ones. I personally think they’re some of the most useful tools out there. But these same expressions seem to be hated by a lot of people, with valid reasons for some, less so for others in my opinion. I’ll try to reply to the most common claims, then show why I think you should at least give them a second chance.

Why would I learn how to create one when I have tools that can do that for me?

It’s true that today’s technology (and humanity) allows us to generate regular expressions on demand, through AI RegExp websites and such. The little subtlety you’ll need to answer to, though, is the following: is it correct? Does it cover all your needs?

Let’s say you’re working on an ID string validator. You need a somewhat complex RegExp prompt like « give me an alphanumeric validator, followed by 1 to 4 consecutive underscores. Do not start with an asterisk, treat string boundaries as input boundaries », and you decided to ask an AI for it. It gives you the following result: /^[^*][a-z0-9]+_{1,4}$/ — which is a mostly accurate output for the given prompt.

Is it entirely correct?
Not quite. This regexp fails to match (i.e., doesn’t match any part of your string):

  • If there are no alphanumeric characters (for example ____): you should replace [a-z0-9]+ by [a-z0-9]*. The + needs at least one character in order to capture something (otherwise it interrupts the regex and returns null), while the asterisk does not.
  • If your input is case-insensitive: right now the current regex only matches lowercase letters and will fail with a single uppercase one. You can add a “case-insensitive” flag (/…./i), or specify a broader range of accepted characters in the brackets.
  • If you need to capture letters outside the ASCII charset (for example, if you need to match an input in any language other than English). In JavaScript, you’ll need to use the Unicode flag. And inside the captured letters, you need to specify a letter class with p{Letter}.

This gives us a completed regex of: /^[^*][a-z0-9p{Letter}]*_{1,4}$/iu. In plain English, this regex: doesn’t match asterisks, includes all alphanumeric characters including the ones outside of the ASCII range, makes them optional, and ends with between 1 and 4 underscores. The regex is also case-insensitive.
Finding and fixing these issues is definitely possible, but it’s a skill that needs to be practiced and nurtured first.

Even xkcd agrees: Regexes are awesome (source)

Why spend any time learning Regexes anyway? Those things are too difficult for what they’re worth

Regexes are indeed hard to create. They have a syntax that is not obvious on first glance, and it’s hard to conceptualize how a Regex works when faced with one. But here are a few reasons to at least scratch the Regex surface:

  • Regexes are a part of many languages’ standard libraries. Their specifics might vary, but the syntax’s the same for the most part. So this is something you’ll carry with you whenever you try out new programming languages.
  • grep and find are Unix tools whose purpose is finding and searching for filenames, file contents, etc. You can use Regexes (with the mostly-standard syntax mentioned earlier) to narrow your search query down and build scripts on top of them (see examples below).
  • Regexes are the fundamental mechanism behind password and input validation in forms. Ever had a website tell you that “you’re missing a special character” or “the password doesn’t have a number”? Ever needed a way to validate a form field, so that it only contains alphanumeric characters? That’s where RegExes come in to play. They are, in this scenario, the most efficient way to validate your input before doing anything with it.

There’s one exception to that rule, though, and it’s email validation. Whatever you do, just don’t use regexes and use something else.

Alright, I made my first Regex but I still think they’re difficult to understand. Any tips to improve a Regex? Make it more readable?

First of all, please document and comment your Regex. You’ll be grateful for that gesture when you come back to read it a few weeks later.

There is one feature I’d like to introduce to help you improve your Regex, which is capture groups. Capture groups allow you to retain a subset of an expression. An example use case would be capturing the international phone indicator while making sure what you captured is actually a phone number. This would be a basic example:

const frPhoneNumber = “+33 6 32930129”;
const lbPhoneNumber = “+961 71 34 94 82”;
const indicatorReg = /^(+[0–9]+)/;
const matchedString = frPhoneNumber.match(indicatorReg);
assert(matchedString[0], “+33”);
const lbMatchedString = lbPhoneNumber.match(indicatorReg);
assert(lbMatchedString[0], "+961");

At the start of the line, open a capturing group (with the parentheses), grab a plus sign and any number you can find ; then find as many consecutive numbers as possible.

One little problem with capture groups is that they rely on indices, like in an array, to store their results. This, indeed, poses a problem with maintenance and readability, which is why named capture groups exist. The syntax, to be fair, is a little obtuse. But not only we get to keep what makes a capturing group, a capturing group, but we’ll be able to name it, contextualize it, make it a variable like any other. The RegExp in question will return a « groups » property, and that property is itself a dictionary that maps each group to its matched string, provided anything was matched, obviously.

const frPhoneNumber = "+33 6 32930129";
const lbPhoneNumber = "+961 71 34 94 82";

const indicatorReg = /^(?<indicator>+[0–9]+)/;
const matchedString = indicatorReg.exec(frPhoneNumber);
const lbMatchedString = indicatorReg.exec(lbPhoneNumber);

assert(matchedString.groups.indicator, "+33");
assert(lbMatchedString.groups.indicator, "+961");

The Regex behaves exactly as it did before. But instead of accessing it using a numerical index, you can access it through its capturing group name ( indicator) and the method RegExp.exec (and not String.match from earlier).

A few personal use cases

Batch renaming

I had to convert a workspace that was full of JavaScript / JSX files into their TypeScript equivalents. It was a big back-office web app with a lot of components, sub-components, sub-sub-components, etc.

The most “obvious” approach would be to go through each file, convert its extension to TypeScript, fix its typing issues, then rinse and repeat over the whole workspace. Or, you could create a little script that goes as follows (assuming my code was under ./src):

  • The first step is finding all the files that need replacing. They end with either .js or .jsx, so all I needed was to run find ./src -regex ".js(x|)$" in a Bash shell. This regex simply amounts to: Match the filenames where the final four characters are a dot, j, s, and an x. Also match these filenames where there's no x.
  • So now that I had a list of target files, I could simply run a rename operation from js to ts , and jsx to tsx (basically swap the j for a t). And voilà. What could’ve been a long and painstaking task has been solved in one line or two. Now all I need to do is the type-fixing operation.

And that’s just with simple, 2-step scripts. Regex results can be coupled with any tool that can read and write from standard output (Node, Go, Rust, whatever).

Find & Replace in IDEs

A colleague once had this example JSON:

“field1”: 34,
“field2”: [82, 93, 77],
“field3”: 1,
“field4”: 344,
“field5”: [8212, 19, 7],
“field6”: 1,
“field7”: 34,
“field8”: [2, 9, 777],
“field9”: 1,
“field10”: 3,
“field11”: 8,
“field12”: 1,
“field13”: 34,
“field14”: 56,
“field15”: 21,
...50 more fields

And had to replace all these numbers with their string equivalents (e.g. 77 to “77” ). He was in a hurry, he needed the change as soon as possible. Instead of going through each field and adding the quotation marks, what he could’ve done (and ended up doing) is simply running this RegExp in his IDE: Replace all occurences of /(: [0-9]+)/ by ': "$1"' .

IDEs support capture groups for more precise Find & Replace operations. The reason why I had him add the colon before the number, is to allow the RegExp to distinguish between keys that could have a number, and the value that is a number, so that keys such as field14don’t get caught in the Find operation.

Find & Replace operations match all results by default, so we simply assign all the numbers to a capturing group, and then surround that group by quotation marks when applying the Replace operation.

Some advice to help you like regexes more (or hate them less)

  • Get acquainted with the terminology: terms like “capturing group”, “quantificator”, “lookahead”, “lookbehind”, can and will look intimidating at first. But remember that even the word “variable” seemed alien to you at some point.
  • Start practicing with simple, well-documented use cases, like your country’s phone numbers or basic field value validation.
  • Slowly add complexity as you go, trying to implement more and more advanced features: maybe you could capture a certain number only if it’s followed by two question marks. Maybe you could try to only capture words and not the spaces surrounding them.
  • Get some help! Tools like regex101.com provide sandboxes where not only your regex will be decoded and explained, but you can also test multiple of inputs against that regex in real time.

In defense of regular expressions was originally published in ekino-france on Medium, where people are continuing the conversation by highlighting and responding to this story.