Regular Expressions¶

Raw Strings¶

Backslashes are used to escape characters in strings.

>>> file_name = "C:\projects\nathan"
>>> file_name
'C:\\projects\nathan'
>>> print(file_name)
C:\projects
athan

If we want to use a literal backslash, we need to escape it by putting two backslashes instead:

>>> file_name = "C:\\projects\\nathan"
>>> file_name
'C:\\projects\\nathan'
>>> print(file_name)
C:\projects\nathan

We can turn off character escaping completely by using “raw” strings, which you can make by prefixing your string with an r character:

>>> file_name = r"C:\projects\nathan"
>>> file_name
'C:\\projects\\nathan'
>>> print(file_name)
C:\projects\nathan

search¶

The regular expression module includes helper tools that allows us to use regular expressions to work with strings.

We can search for one string in another string with the search function.

>>> import re
>>> greeting = "hello world"
>>> re.search("hello", greeting)
<_sre.SRE_Match object; span=(0, 5), match='hello'>
>>> match = re.search("hello", greeting)
>>> help(match)

>>> match.start()
0
>>> match.end()
5
>>> match.group()
'hello'
>>> greeting[match.start():match.end()]
'hello'

That example isn’t particularly interesting because we could do that with the string find method.

We can use \w (the w must be lowercase) to search for “word” characters (alphabetical characters or underscore):

>>> match = re.search(r'\w', greeting)
>>> match
<_sre.SRE_Match object; span=(0, 1), match='h'>
>>> match.group()
'h'

Notice that we are using raw strings here so we don’t need to escape our backslashes.

The * command will cause the previous command to be matched zero or more times, matching as many times as possible. Since spaces are not word characters, this match stops just before the first space character.

>>> match = re.search(r'\w*', greeting)
>>> match.group()
'hello'

You can find more information on the regular expression syntax in the re module documentation.

findall¶

The findall function works similar to search except that instead of a match object, it returns a list of every match in a string.

>>> match = re.findall(r'\w*', greeting)
>>> match
['hello', '', 'world', '']

Note that the space and end of string result in empty strings in this list because we’re searching for zero or more word characters, and they do match zero word characters.

The + command will match the previous command to be one or more times. This will remove the empty strings since space and the end of the string do not have one or more word characters.

>>> re.findall(r'\w+', greeting)
['hello', 'world']

More regex syntax¶

The . command will match any character:

>>> re.findall(r'o.', greeting)
['o ', 'or']
>>> re.findall(r'.l.', greeting)
['ell', 'rld']

If we want to match a literal period character we will need to use a backslash to escape the period:

>>> re.findall(r'\.', "hi.")
['.']

We can use square brackets to match any character in a group:

>>> re.findall(r'[aeiou]', greeting)
['e', 'o', 'o']

We can also use a dash to match characters within ranges of characters:

>>> re.findall(r'[a-z]', "Hi there")
['i', 't', 'h', 'e', 'r', 'e']
>>> re.findall(r'[A-Za-z]', "Hi there")
['H', 'i', 't', 'h', 'e', 'r', 'e']
>>> re.findall(r'[0-9]', "Hi there")
[]

The caret character can be used at the beginning of a character set to denote negation. This will match anything except for lowercase letters:

>>> re.findall(r'[^a-z]', "Hi there")
['H', ' ']

split¶

Let’s split a string by words:

>>> sentence = "Oh what a day, what a lovely day!"
>>> from collections import Counter
>>> sentence.split()
['Oh', 'what', 'a', 'day,', 'what', 'a', 'lovely', 'day!']
>>> Counter(sentence.split())
Counter({'what': 2, 'a': 2, 'day!': 1, 'day,': 1, 'lovely': 1, 'Oh': 1})

The split method on strings splits based on whitespace characters.

With the split function in the re, we can split a string by a regular expression.

Let’s use \W to split based on one or more “non-word” characters

>>> re.split(r'\W+', sentence)
['Oh', 'what', 'a', 'day', 'what', 'a', 'lovely', 'day', '']
>>> Counter(re.split(r'\W+', sentence))
Counter({'day': 2, 'what': 2, 'a': 2, '': 1, 'lovely': 1, 'Oh': 1})
>>> Counter(filter(None, re.split(r'\W+', sentence)))
Counter({'what': 2, 'day': 2, 'a': 2, 'lovely': 1, 'Oh': 1})

We could accomplish nearly the same thing by using findall to find all words.

>>> re.findall('\w+', sentence)
['Oh', 'what', 'a', 'day', 'what', 'a', 'lovely', 'day']
>>> Counter(re.findall('\w+', sentence))
Counter({'a': 2, 'day': 2, 'what': 2, 'Oh': 1, 'lovely': 1})

match¶

A caret can be put at the beginning of a regular expression to denote that it should only match starting at the beginning of the target string. Similarly, a dollar sign can be put at the end of a regular expression to denote that it should only match starting at the end of the target string.

>>> re.search(r'^wo', "hello world")
>>> re.search(r'^wo', "world")
<_sre.SRE_search object; span=(0, 2), match='wo'>
>>> re.search(r'^wo$', "world")
>>> re.search(r'^wo$', "wo")
<_sre.SRE_Match object; span=(0, 2), match='wo'>

The match function is the same as using search with a caret. So match requires that the pattern to start at the beginning of the string.

>>> re.match('what', sentence)
>>> re.match('.*what', sentence)
<_sre.SRE_Match object; span=(0, 19), match='Oh what a day, what'>
>>> re.match('Oh', sentence)
<_sre.SRE_Match object; span=(0, 2), match='Oh'>

The fullmatch function is the same as using search with both a caret and a dollar sign. So fullmath requires the pattern to match the entire string.

>>> re.fullmatch(r'what', sentence)
>>> re.fullmatch(r'.*what', sentence)
>>> re.fullmatch(r'.*what.*', sentence)
<_sre.SRE_Match object; span=(0, 33), match='Oh what a day, what a lovely day!'>

sub¶

You can use the sub function to replace parts of a string.

Let’s replace all vowels with the letter “x”:

>>> re.sub(r'[aeiou]', r"x", greeting)
'hxllx wxrld'

We can use parentheses to group parts of regular expressions. Groups can be referenced by their number in the replacement string:

>>> re.sub(r'([aeiou])', r"x\1", greeting)
'hxellxo wxorld'

We can use \d to match a digit character and \D to match a non-digit character. Curly braces can be used with a number inside to repeat the last command a certain number of times.

Let’s make a regular expression to normalize phone numbers:

>>> re.sub(r'\D*(\d{3})\D*(\d{3})\D*(\d{4})\D*', r"\1-\2-\3", "(202) 456-1111")
'202-456-1111'
>>> re.sub(r'\D*(\d{3})\D*(\d{3})\D*(\d{4})\D*', r"\1-\2-\3", "202 - 456 - 1111")
'202-456-1111'
>>> re.sub(r'\D*(\d{3})\D*(\d{3})\D*(\d{4})\D*', r"\1-\2-\3", "202.456.1111")
'202-456-1111'

compile¶

If we need to use the same regular expression multiple times throughout our program, it’s a good idea to pre-compile it to increase performance.

>>> phone_re = re.compile(r'\D*(\d{3})\D*(\d{3})\D*(\d{4})\D*')
>>> phone_re.sub(r"(\1) \2 - \3", "202.456.1111")
'(202) 456 - 1111'
>>> phone_re.sub(r"\1-\2-\3", "202.456.1111")
'202-456-1111'

Regular Expression Exercises¶

Count Punctuation¶

Make a function count_punctuation that takes a string and returns a count of all punctuation characters in the string.

Punctuation characters are characters which are not word characters and are not whitespace characters

Hint

You can match punctuation characters with this regular expression: [^ \w]

>>> count_punctuation("^_^ hello there! @_@")
{'^': 2, '@': 2, '!': 1}
>>> count_punctuation(declaration)
{',': 122, '.': 36, ':': 10, ';': 9, '-': 4, '—': 1, '’': 1}

Count Words¶

Write a function that returns a count of all numbers in a given string.

Hint

You can match a number by using a regular expression that matches one or more consecutive digits: \d+

>>> count_numbers(declaration)
{'4': 1, '1776': 1}
>>> count_numbers("Why was 6 afraid of 7? Because 7 8 9.")
{'7': 2, '9': 1, '6': 1, '8': 1}