Regular Expressions

Raw Strings

Backslashes are used to escape characters in strings.

>>> file_name = "C:\projects\nathan"
>>> file_name
'C:\\projects\nathan'
>>> print(file_name)
C:\projects
athan

If we want to use a literal backslash, we need to escape it by putting two backslashes instead:

>>> file_name = "C:\\projects\\nathan"
>>> file_name
'C:\\projects\\nathan'
>>> print(file_name)
C:\projects\nathan

We can turn off character escaping completely by using “raw” strings, which you can make by prefixing your string with an r character:

>>> file_name = r"C:\projects\nathan"
>>> file_name
'C:\\projects\\nathan'
>>> print(file_name)
C:\projects\nathan

findall

The findall function works similar to search except that instead of a match object, it returns a list of every match in a string.

>>> match = re.findall(r'\w*', greeting)
>>> match
['hello', '', 'world', '']

Note that the space and end of string result in empty strings in this list because we’re searching for zero or more word characters, and they do match zero word characters.

The + command will match the previous command to be one or more times. This will remove the empty strings since space and the end of the string do not have one or more word characters.

>>> re.findall(r'\w+', greeting)
['hello', 'world']

More regex syntax

The . command will match any character:

>>> re.findall(r'o.', greeting)
['o ', 'or']
>>> re.findall(r'.l.', greeting)
['ell', 'rld']

If we want to match a literal period character we will need to use a backslash to escape the period:

>>> re.findall(r'\.', "hi.")
['.']

We can use square brackets to match any character in a group:

>>> re.findall(r'[aeiou]', greeting)
['e', 'o', 'o']

We can also use a dash to match characters within ranges of characters:

>>> re.findall(r'[a-z]', "Hi there")
['i', 't', 'h', 'e', 'r', 'e']
>>> re.findall(r'[A-Za-z]', "Hi there")
['H', 'i', 't', 'h', 'e', 'r', 'e']
>>> re.findall(r'[0-9]', "Hi there")
[]

The caret character can be used at the beginning of a character set to denote negation. This will match anything except for lowercase letters:

>>> re.findall(r'[^a-z]', "Hi there")
['H', ' ']

split

Let’s split a string by words:

>>> sentence = "Oh what a day, what a lovely day!"
>>> from collections import Counter
>>> sentence.split()
['Oh', 'what', 'a', 'day,', 'what', 'a', 'lovely', 'day!']
>>> Counter(sentence.split())
Counter({'what': 2, 'a': 2, 'day!': 1, 'day,': 1, 'lovely': 1, 'Oh': 1})

The split method on strings splits based on whitespace characters.

With the split function in the re, we can split a string by a regular expression.

Let’s use \W to split based on one or more “non-word” characters

>>> re.split(r'\W+', sentence)
['Oh', 'what', 'a', 'day', 'what', 'a', 'lovely', 'day', '']
>>> Counter(re.split(r'\W+', sentence))
Counter({'day': 2, 'what': 2, 'a': 2, '': 1, 'lovely': 1, 'Oh': 1})
>>> Counter(filter(None, re.split(r'\W+', sentence)))
Counter({'what': 2, 'day': 2, 'a': 2, 'lovely': 1, 'Oh': 1})

We could accomplish nearly the same thing by using findall to find all words.

>>> re.findall('\w+', sentence)
['Oh', 'what', 'a', 'day', 'what', 'a', 'lovely', 'day']
>>> Counter(re.findall('\w+', sentence))
Counter({'a': 2, 'day': 2, 'what': 2, 'Oh': 1, 'lovely': 1})

match

A caret can be put at the beginning of a regular expression to denote that it should only match starting at the beginning of the target string. Similarly, a dollar sign can be put at the end of a regular expression to denote that it should only match starting at the end of the target string.

>>> re.search(r'^wo', "hello world")
>>> re.search(r'^wo', "world")
<_sre.SRE_search object; span=(0, 2), match='wo'>
>>> re.search(r'^wo$', "world")
>>> re.search(r'^wo$', "wo")
<_sre.SRE_Match object; span=(0, 2), match='wo'>

The match function is the same as using search with a caret. So match requires that the pattern to start at the beginning of the string.

>>> re.match('what', sentence)
>>> re.match('.*what', sentence)
<_sre.SRE_Match object; span=(0, 19), match='Oh what a day, what'>
>>> re.match('Oh', sentence)
<_sre.SRE_Match object; span=(0, 2), match='Oh'>

The fullmatch function is the same as using search with both a caret and a dollar sign. So fullmath requires the pattern to match the entire string.

>>> re.fullmatch(r'what', sentence)
>>> re.fullmatch(r'.*what', sentence)
>>> re.fullmatch(r'.*what.*', sentence)
<_sre.SRE_Match object; span=(0, 33), match='Oh what a day, what a lovely day!'>

sub

You can use the sub function to replace parts of a string.

Let’s replace all vowels with the letter “x”:

>>> re.sub(r'[aeiou]', r"x", greeting)
'hxllx wxrld'

We can use parentheses to group parts of regular expressions. Groups can be referenced by their number in the replacement string:

>>> re.sub(r'([aeiou])', r"x\1", greeting)
'hxellxo wxorld'

We can use \d to match a digit character and \D to match a non-digit character. Curly braces can be used with a number inside to repeat the last command a certain number of times.

Let’s make a regular expression to normalize phone numbers:

>>> re.sub(r'\D*(\d{3})\D*(\d{3})\D*(\d{4})\D*', r"\1-\2-\3", "(202) 456-1111")
'202-456-1111'
>>> re.sub(r'\D*(\d{3})\D*(\d{3})\D*(\d{4})\D*', r"\1-\2-\3", "202 - 456 - 1111")
'202-456-1111'
>>> re.sub(r'\D*(\d{3})\D*(\d{3})\D*(\d{4})\D*', r"\1-\2-\3", "202.456.1111")
'202-456-1111'

compile

If we need to use the same regular expression multiple times throughout our program, it’s a good idea to pre-compile it to increase performance.

>>> phone_re = re.compile(r'\D*(\d{3})\D*(\d{3})\D*(\d{4})\D*')
>>> phone_re.sub(r"(\1) \2 - \3", "202.456.1111")
'(202) 456 - 1111'
>>> phone_re.sub(r"\1-\2-\3", "202.456.1111")
'202-456-1111'

Regular Expression Exercises

Count Punctuation

Make a function count_punctuation that takes a string and returns a count of all punctuation characters in the string.

Punctuation characters are characters which are not word characters and are not whitespace characters

Hint

You can match punctuation characters with this regular expression: [^ \w]

>>> count_punctuation("^_^ hello there! @_@")
{'^': 2, '@': 2, '!': 1}
>>> count_punctuation(declaration)
{',': 122, '.': 36, ':': 10, ';': 9, '-': 4, '—': 1, '’': 1}

Count Words

Write a function that returns a count of all numbers in a given string.

Hint

You can match a number by using a regular expression that matches one or more consecutive digits: \d+

>>> count_numbers(declaration)
{'4': 1, '1776': 1}
>>> count_numbers("Why was 6 afraid of 7? Because 7 8 9.")
{'7': 2, '9': 1, '6': 1, '8': 1}