Regular Expressions
Suppose you store all of your friends’ contact information in a text file and are looking for one of your friends’ email address. You don’t remember exactly what it is but you do know that it contains his name and his birth year (in this specific order). Regex can you help find you friend’s email address with the little information you know. We’ll see how in the next section.
A regular expression (shortened as regex or regexp) is a sequence of characters that define a search pattern. Usually such patterns are used by string searching algorithms for “find” or “find and replace” operations on strings, or for input validation.
You can think of regular expressions as wildcards (also known as glob patterns) on steroids. You are probably familiar with wildcard notations such as *.txt
to find all text files in a file manager. The regex equivalent is /^.*\.txt$/
.
Basic Example
Remember when you were looking for your friend’s email address? Suppose his name is “Bob” and his birth year is “1992”. You know that his email address contains both his name and birth year.
Suppose the following is a small extract of your text file:
...
notaprguy@outlook.com
kmiller@yahoo.ca
bobkeller1992@gmail.com
sbmrjbr@outlook.com
slaff@yahoo.com
...
This file is hudge and it would take hours to find Bob’s email address. You also can’t use the Search
tool because you don’t know his exact address. Your best and fastest chance at finding Bob’s contact information is using a regular expression. The following regular expression will return Bob’s email address:
/bob.*1992@.*\.[a-zA-Z]{2,4}/
We’ll explain along this tutorial what all the different components of the regex above means.
Basic Syntax
A regular expression is just a pattern of characters that we use to perform a search in a text. For example, the regular expression /the/
means: the letter t
, followed by the letter h
, followed by the letter e
. The /
character delimits the start and the end of the regular expression.
The regular expression /the/
matches the characters the
literally (case sensitive). There is one full match in the sentence.
/the/g > The fat cat sat on the mat.
The same applies for the regex /The/
. Note that both expressions are different because regexes are case sensitive.
/The/g > The fat cat sat on the mat.
Javascript
In JavaScript, regular expressions are also objects. These patterns are used with the exec()
and test()
methods of RegExp, and with the match()
, matchAll()
, replace()
, replaceAll()
, search()
, and split()
methods of String
.
You construct a regular expression in one of two ways:
-
Using a regular expression literal, which consists of a pattern enclosed between slashes, as follows:
let regex = /The/
Regular expression literals provide compilation of the regular expression when the script is loaded. If the regular expression remains constant, using this can improve performance.
-
Calling the constructor function of the RegExp object, as follows:
let regex = new RegExp('The')
Using the constructor function provides runtime compilation of the regular expression. Use the constructor function when you know the regular expression pattern will be changing, or you don’t know the pattern and are getting it from another source, such as user input.
The exec()
method executes a search for a match in a string. It returns an array of information or null on a mismatch. In the following example, the script uses the exec()
method to find a match in a string.
let regex = /The/g
let res = regex.exec('The fat cat sat on the mat.')
You can print the res
variable in the console using console.log()
which yields the following result:
["The"]
Python
Python has a built-in package called re
, which can be used to work with regular expressions.
import re
regex = r"The"
test_str = "The fat cat sat on the mat."
res = re.findall(regex, test_str)
The re.findall()
method returns all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.
The output of print(res)
is:
['The']
More information here: re — Regular expression operations documentation.
Java
Java provides the java.util.regex
package for pattern matching with regular expressions.
A regular expression, specified as a string, must first be compiled into an instance of this class. The resulting pattern can then be used to create a Matcher object that can match arbitrary character sequences against the regular expression. All of the state involved in performing a match resides in the matcher, so many matchers can share the same pattern.
String regex = "The";
String str = "The fat cat sat on the mat.";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(str);
while (m.find()) {
System.out.println(m.group(0));
}
Pattern.compile()
compiles the given regular expression into a pattern.p.matcher()
creates a matcher that will match the given input against this pattern.m.find()
attempts to find the next subsequence of the input sequence that matches the pattern.m.group()
returns the input subsequence captured by the given group during the previous match operation.
This will produce the following output: The
.
sed
sed is a stream editor. A stream editor is used to perform basic text transformations on an input stream (a file or input from a pipeline). While in some ways similar to an editor which permits scripted edits (such as ed), sed works by making only one pass over the input(s), and is consequently more efficient. But it is sed’s ability to filter text in a pipeline which particularly distinguishes it from other types of editors.
Following is the general syntax for sed:
/pattern/command
Here, pattern is a regular expression, and command is one of the commands defined as follows:
p
prints the lined
deletes the lines/pattern1/pattern2/
substitutes the first occurrence ofpattern1
withpattern2
By default sed prints all processed input (except input that has been modified/deleted by commands such as d). Use -n
to suppress output, and the p
command to print specific lines. While matching patterns, you can use regular expressions.
$ echo "The fat cat sat on the mat." | sed -n '/The/p'
The output is as follows:
The fat cat sat on the mat.
Note that sed prints the entire line containing the match.
Meta Characters
Meta characters are the building blocks of regular expressions. Meta characters do not stand for themselves but instead are interpreted in some special way. Some meta characters have a special meaning and are written inside square brackets. The meta characters are as follows:
Meta character | Description |
---|---|
. | Period matches any single character except a line break. |
[ ] | Character class. Matches any character contained between the square brackets. |
[^ ] | Negated character class. Matches any character that is not contained between the square brackets |
* | Matches 0 or more repetitions of the preceding symbol. |
+ | Matches 1 or more repetitions of the preceding symbol. |
? | Makes the preceding symbol optional. |
{n,m} | Braces. Matches at least “n” but not more than “m” repetitions of the preceding symbol. |
(xyz) | Character group. Matches the characters xyz in that exact order. |
| | Alternation. Matches either the characters before or the characters after the symbol. |
\ | Escapes the next character. This allows you to match reserved characters [ ] ( ) { } . * + ? ^ $ \ | |
^ | Matches the beginning of the input. |
$ | Matches the end of the input. |
The Full Stop
The meta character .
matches any single character. It will not match return or newline characters. For example, the regular expression .at
matches any character (except for line terminators) followed by the characters at
literally (case sensitive)
/.at/g > The fat cat sat on the mat.
Character Set
Square brackets are used to specify character sets. Use a hyphen inside a character set to specify the characters’ range. The order of the character range inside the square brackets doesn’t matter. For example, the regular expression [Tt]he
matches a single character in the list [T, t]
(case sensitive) and the characters he
literally (case sensitive).
/[Tt]he/g > The fat cat sat on the mat.
A period inside a character set, however, means a literal period. The regular expression at[.]
matches the characters at
literally (case sensitive) and [.] matches the character .
literally (case sensitive).
/at[.]/g > The fat cat sat on the mat.
Negated Character Set
In general, the caret symbol represents the start of the string, but when it is typed after the opening square bracket it negates the character set. For example, the regular expression [^c]at
matches the character c
literally (case sensitive) and excludes it and matches the characters at
literally (case sensitive).
/[^c]at/g > The fat cat sat on the mat.
Repetitions
The meta characters +
, *
or ?
are used to specify how many times a subpattern can occur.
The Star
The *
symbol matches zero or more repetitions of the preceding matcher. The regular expression a*
means: zero or more repetitions of the preceding lowercase character a
. But if it appears after a character set or class then it finds the repetitions of the whole character set. For example, the regular expression [a-z]*
means: any number of lowercase letters in a row.
/[a-z]*/g > The fat cat sat on the mat 1234.
The *
symbol can be used with the meta character .
to match any string of characters .*
like we did in the first example to find Bob’s email address.
The Plus
The +
symbol matches one or more repetitions of the preceding character. For example, the regular expression c.+t
means: a lowercase c
, followed by at least one character, followed by a lowercase t
. It needs to be clarified that t
is the last t
in the sentence.
/c.+t/g => The fat cat sat on the mat.
The Question Mark
The meta character ?
makes the preceding character optional. This symbol matches zero or one instance of the preceding character. For example, the regular expression [T]?he
means: optional uppercase T
, followed by a lowercase h
, followed by a lowercase e
.
/[T]?he/g => The fat cat sat on the mat.
Braces
Braces (also called quantifiers) are used to specify the number of times that a character or a group of characters can be repeated. For example, the regular expression [0-9]{2,3}
means: match at least 2 digits, but not more than 3, ranging from 0 to 9.
/[0-9]{2,3}/g => The number was 9.9997 but we rounded it off to 10.0.
Capturing Groups
A capturing group is a group of sub-patterns that is written inside parentheses. It is possible to use the alternation |
meta character inside a capturing group. For example, the regular expression (c|f|s)at
means: a lowercase c
, f
or s
, followed by a
, followed by t
.
/(c|f|s)at/g => The fat cat sat on the mat.
Non-Capturing Groups
A non-capturing group is a capturing group that matches the characters but does not capture the group. A non-capturing group is denoted by a ?
followed by a :
within parentheses. For example, the regular expression (?:c|f|s)at
is similar to (c|f|s)at
in that it matches the same characters but will not create a capture group.
Alternation
In a regular expression, the vertical bar |
is used to define alternation. Alternation is like an OR statement between multiple expressions. Now, you may be thinking that character sets and alternation work the same way. But the big difference between character sets and alternation is that character sets work at the character level but alternation works at the expression level. For example, the regular expression (T|t)he|fat
means: either (an uppercase T
or a lowercase t
, followed by a lowercase h
, followed by a lowercase e
) OR (a lowercase f
, followed by a lowercase a
, followed by a lowercase t
). Note that I included the parentheses for clarity, to show that either expression in parentheses can be met and it will match.
/(T|t)he|fat/g => The fat cat sat on the mat.
Escaping Special Characters
A backslash \
is used in regular expressions to escape the next character. This allows us to include reserved characters such as { } [ ] / \ + * . $ ^ | ?
as matching characters. To use one of these special character as a matching character, prepend it with \
.
For example, the regular expression .
is used to match any character except a newline. Now, to match .
in an input string, the regular expression (f|c|m)at\.?
means: a lowercase f
, c
or m
, followed by a lowercase a
, followed by a lowercase t
, followed by an optional .
character.
/(f|c|m)at\.?/g => The fat cat sat on the mat.
Anchors
In regular expressions, we use anchors to check if the matching symbol is the starting symbol or ending symbol of the input string. Anchors are of two types: The first type is the caret ^
that check if the matching character is the first character of the input and the second type is the dollar sign $
which checks if a matching character is the last character of the input string.
The Caret
The caret symbol ^
is used to check if a matching character is the first character of the input string. If we apply the following regular expression ^(T|t)he
which means: an uppercase T
or a lowercase t
must be the first character in the string, followed by a lowercase h
, followed by a lowercase e
, we’ll get the following result.
/^(T|t)he/g => The fat cat sat on the mat.
The Dollar Sign
The dollar sign $
is used to check if a matching character is the last character in the string. For example, the regular expression (at\.)$
means: a lowercase a
, followed by a lowercase t
, followed by a .
character and the matcher must be at the end of the string.
/(at\.)$/g => The fat cat. sat. on the mat.
Shorthand Character Sets
There are a number of convenient shorthands for commonly used character sets:
Shorthand | Description |
---|---|
\w | Matches alphanumeric characters: [a-zA-Z0-9_] |
\W | Matches non-alphanumeric characters: [^\w] |
\d | Matches digits: [0-9] |
\D | Matches non-digits: [^\d] |
\s | Matches whitespace characters: [\t\n\f\r\p{Z}] |
\S | Matches non-whitespace characters: [^\s] |
Flags
Flags are also called modifiers because they modify the output of a regular expression. These flags can be used in any order or combination, and are an integral part of the RegExp.
Flag | Description |
---|---|
i | Case insensitive: Match will be case-insensitive. |
g | Global Search: Match all instances, not just the first. |
m | Multiline: Anchor meta characters work on each line. |
Lab: Email Validation
Creating a regex to validate an email address may seem simple at first but it turns out to be a complex task.
The following list is a summary of email address standards:
- All email addresses are in 7-bit US ASCII.
- Email addresses consist of a local part, the “@” symbol, and the domain.
- TEXT can contain alphabetic, numeric, and these symbols: !#$%’*+-/=?^_`{|}~
- The domain can be bracketed or plain.
- The maximum length of the local part is 64 characters.
- The maximum length of a domain is 253 characters.
- The maximum allowable length of an email address is 320 characters.
- The top level domain must be all alphabetic.
These standards are only the tip of the iceberg. If you want to see the full list, check out the official RFC 5322.
Your mission, should you choose to accept it, is to generate a regular expression to validate (most) email addresses. Build it step by step. Start of by validating the general pattern:
123@abc.xyz
View simple answer
/^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}$/
View advanced answer
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])