1
Posted January 10, 2013 by Hunter in Python Programming
 
 

How to Use Regular Expressions in Python

regex
regex

Today we talk about regular expressions in Python.

“Up the creek without a paddle.”
“Don’t throw the baby out with the bathwater.”
“Apple of my eye.”
“Can’t swing a cat without hitting a [blank].”
“Not for all the tea in China.”

Like this?  Nope.  Although those are expressions, and they do occur regularly, that is not quite what is meant by regular expressions in the programming world.  A regular expression is a series of characters that, within the syntax of regular expressions, represents a set of strings.*  To give a few quick examples, before we jump into the details of the syntax:

*It is very important that you understand that I am not referring to the technical Python set, such as you would create with S = set(), but instead to the general concept of a set.

r”Katel[yi]n”

is a regular expression expressing “Katelyn” and “Katelin”.

r”Jim .*”

is a regular expression expressing “Jim” followed by a space followed by an arbitrary number of any character.

r”[a-e][0-8]“

is a regular expression expressing a string of two characters, where the first is any lowercase letter a through e, and the second is any digit 0 through 8.

So why would you want to be able to represent a “set” of strings?  Why, for searching within text for strings that belong to a specified set of strings, of course!  Regular expressions are used to search through text and match to strings contained within the set the expression represents.  For example:

# The re module supports regular expressions in Python

import re

text = "Hello, my name is Katelyn!"

# Checks within the string 'text' for a match with either 'Katelyn' or 'Katelin'

match = re.search(r"Katel[yi]n", text)

# If the variable 'match' has a non-None value - that is, if the regular expression was found in the string

if match:
  print "We found her!"
else:
  print "She's not here."

The search method in the re module, written re.search(regex, string, flag), searches left to right through string for the regular expression regex and returns a matched object at the first location where regex is found in string.  If regex is not found, the search method returns None.  The flag is optional, and I will discuss flags later.

The above code produces the output:

We found her!

Another, even more awesome method in the re module is the findall method.  re.findall(regular expression, string, flags) returns a list of all the instances of regular expression in string.

import re

text = "Hello, my name is Katelyn, and her name is Katelin!"

matchall = re.findall(r"Katel[yi]n", text)

if matchall:
  for name in matchall:
    print name
else:
  print "No one here by that name."

This will return

Katelyn

Katelin

So, for one who understands the syntax of regular expressions, no variantly spelled name, no formatted element of a database, no Jim What’s-His-Face is safe from discovery.

The benefits of such an understand are plentiful, one of the more common uses represented in this XKCD comic:

http://imgs.xkcd.com/comics/regular_expressions.png

But fear not, not all uses of regular expressions are so banal as the foiling of a murderer.  Regular expressions can also be used for interesting purposes, such as searching a phone book for James Whosit, or ensuring that the user has entered a valid e-mail address.  But before we can explore that fantastic world, we must, as always, learn some syntax.

A regular expression is identified in Python by an r followed by the regular expression enclosed in quotes, like thus:

r”regular expression

For ease of reading, in the rest of this article I will simply be writing the expression itself, without the r or the “”, for it is certainly within the reader’s mental compass to remember the syntax.

A regular expression can contain literal characters and, as I have heard them called, metacharacters (also called control characters).  A literal character is exactly what it sounds like, representing just what it is, no pretending, no masks or makeup, perfectly secure in what it is.  The regular expression

Myself

will match only to the string “Myself”, and no other.

A metacharacter is also exactly what it sounds like, a character that represents characters other than what it literally is, or acts as syntax to define character classes.  For example, the most meta of the metacharacters is “.”, which represents any character.  So,

.im

represents a string of three characters, the final two of which are “im”, the first of which is anything.  Both “Tim” and “Jim” match to this regular expression, as well as “him” and “Him” and “!im” and “~im”.  Following is a list of metacharacters and descriptions of their use.  Depending on your approach to learning, you may want to skip the lists and explanations and jump to the bottom, to the practical examples, and look back up as need be.

Metacharacters So What?
\ Followed by certain characters, used to represent character classes.  Also used before a metacharacter to treat the metacharacter as a literal.
. Any character, except a newline, unless the M flag is set.
* Matches a string consisting of 0 or more repetitions of the preceding character or group.
+ Matches a string consisting of 1 or more repetitions of the preceding character or group.
? Matches a string consisting of 0 or 1 instances of the preceding character or group.
{} For {n} placed after a regular expression, matches exactly n instances of the preceding character or group.For {n,} placed after a regular expression, matches n or more instances of the preceding character or group.

For {n,m} placed after a regular expression, matches n to m instances of the preceding character or group.

 

^ Matches the beginning of a string.
$ Matches the end of a string.
[] Matches one character within the brackets.
[^] Matches one character not in the brackets, so long as the carrot is the first character inside the brackets.
() Creates a group of what is inside the parentheses.
| Or.  a | bc matches a or bc.

Now let’s explore what each of these mean, and what they can do.

Explorations

\

The backslash deserves a chart of its own!

Metacharacter Character Class
\w Matches any one alphanumeric characters.  Equivalent to [A-Za-z0-9_].
\W Matches any one non-alphanumeric characters.  Equivalent to [^A-Za-z0-9_].
\d Matches any one digit character.  Equivalent to [0-9].
\D Matches any one non-digit character.  Equivalent to [^0-9].
\s Matches any one whitespace character.  Equivalent to [ \n\t\r\v\f].
\S Matches any one non-whitespace character.  Equivalent to [^ \n\t\r\v\f].
\b Matches the empty string at the boundaries of a word.
\B Matches the empty string everywhere but at the boundaries of a word.
\A Matches the start of a string.
\Z Matches the end of a string, except if the final character is a newline.  Then, matches one character before the newline.
\z Matches the end of a string.
\G Matches position of the most recent match.
\1 \2 \3 Matches the nth group.

Most of these need no further explanation.  The final – \1, \2, and so forth – are used with groups, which are formed with the (), and will be discussed when we explore ().  To indicate a literal backslash, simply use the meta of one backslash to kill the meta of another:

\\ Matches literal backslash.

.

Matches any character, but only one of any character.  ….. would match a string of five characters, whatever the characters are.  To match an arbitrary number of arbitrary characters, you would need the *.

*

Matches 0 or more instances of the preceding expression, whether a character or a group.  So to match for any number, of any character, you want the expression .*.  That’s a period followed by an asterisk.  Here is a list of examples of * in use:

.*

Matches an arbitrary number of instances of arbitrary characters.

[0-9]*

Matches an arbitrary number of digits, or the empty set.  Equivalent to \d*.

Helloo*

Matches the string “Hello” followed by an arbitrary number of “o”s.

(Mine)*

Matches an arbitrary number of instances of the string “Mine”, or the empty set.

BOOO*M

Matches the string “BOOM” with at least two “O”s, but with no upper limit, before the “M”.

+

Easily understood after *.  Here’s some examples:

[0-9]+

Matches an arbitrary number of digits.  Equivalent to \d+.

Hello+

Matches the string “Hell” followed by one or more “o”s.

(Mine)+

Matches an arbitrary number of instances of the string “Mine”.

BOO+M

Matches the string “BOOM” with at least two “O”s, but with no upper limit, before the “M”.

?

Also very similar to the *, but easily explained as having the following use: Making the preceding character or group optional.

Yucky?

Matches “Yuck” or “Yucky”.

Hurr?ah!

Matches “Hurah!” or “Hurrah!”

a?ether

Matches “aether” or “ether”.

{}

Scroll brackets are used to match for a specific number or range of instances of the preceding character or group.  Inside the brackets can be a number to specify an exact number of instances, a number followed by a comma to give only a lower limit to the number of instances, or a number, comma, and number, to give a range of instances.

\d{5}

Matches any string of five digits.

\b\w{4}\b

Matches any four-letter word.

[A-Z].{8,}

Matches any string beginning with any capitol letter and followed by eight or more characters.

\d{2-9}

Matches any string of digits two to nine digits long.

 

 

^

Matches the beginning of a string and, if the MULTILINE flag is declared as an argument in the method, also matches the beginning of a line.  Must be, therefore, the first character in the regular expression.

^Cory

Matches “Cory” only if “Cory” is at the beginning of the string or line.  For instance, r”^Cory” would not be found in the string “James Cory”, but would in the string “Cory James”.

$

Matches the end of a string and, if the MULTILINE flag is declared as an argument in the method, also matches the end of a line.  Must be, therefore, the last character in the regular expression.

Cory$

Matches “Cory” only if “Cory” is at the end of a string or a line.  For instance, r”Cory$” would not be found in the string “Cory James”, but would in the string “James Cory”.

[]

Now, we’ve already seen the brackets in r”Katel[yi]n”, where  [yi] matches either “y” or “i”, so that the regular expression matches either “Katelyn” or “Katelin”.  If we wanted to match for a range of letters or numbers, we can use the hyphen, as we’ve also already seen:

[a-e]

matches for any lowercase letter from a to e, equivalent to [abcde]

[0-9]

matches for any digit 0 through 9, equivalent to [0123456789]

[A-Dc-f4-8]

matches for any uppercase letter A through D or lowercase letter c through f or digit 4 through 8, equivalent to [ABCDcdef45678]

But as in science most of what we learn is that which is not, perhaps what you need to know is simply what a character is NOT.  For this, you use the bracketed carrot.

[^a-e]

matches any character that is not a lowercase letter a through e.

Be forewarned!  The carrot must come first, just as when you leading a donkey.  If the carrot is not the first character in the bracket, then it is treated as a literal.

[a-e^]

matches any character that is a lowercase letter a through e or ^.

The same goes for the hyphen; it must be between the lower and upper limit of your range.

[-ae]

matches for -, a, or e.

[ae-]

matches for a, e, or -.

()

Now, groups are fun.  In regular expressions, just like in mathematics, when you enclose a phrase in parentheses, such as (2+5) or (apple), you deal with what is inside as a unit, so that if you have an operator outside, like (2+5)^3 or (apple)+, you apply the operator to the phrase as a whole, giving 343 or “apple” one or more times.  Let’s get into some examples:

ei+o

matches for a string consisting of “e” followed by one or more instances of “i” followed by “o”.

(ei)+o

matches for a string consisting of one or more instances of “ei” followed by “o”.

((abc)+d)+

matches for a string consisting of one or more instances of one or more instances of “abc” followed by “d”.

I said I would come to the \1 \2 deal when we got to groups.  The backslash-number notation indicates that number group.  Groups are numbered by counting the opening parentheses from left to right in the regular expression.  So, in the regular expressions

(ha)+(he)*

the first group is (ha) and the second is (he).  The most practical example I can think of to demonstrate the how and the why of using \1 is the regular expression that matches either single or double quotes, and the text between.

(["'])[^\1]*\1

The () put ["'] in a group.  It’s the first group created in this expression, so it is indicated thenceforth within the expression by \1.  But ["'] will match to either a double or a single quote, so what does \1 match to?  \1 will match only to what was matched previously; so if you match a double quote in (["']), then \1 matches a double quote.  If single, then single.  Beautiful, isn’t it?

|

The Or operand.  It works just as you would imagine.

hi|bye

matches for either “hi” or “bye”.

hi(ho|yo)

matches for either “hiho” or “hiyo”.

The | can be used in the flag argument of the search method, to use multiple flags in your matching.

Flags

Declared as the third argument in the search or match methods.  Sets a mode in which to match.

Flag Does…
re.I Ignores case.  So “a” will match “A” and “A” will match “a”.
re.M Multiline.  Makes ^ and $ match the beginning of end of a line, respectively, not just string.
re.S Makes the . metacharacter match newline.  Without re.S, . will stop matching at the end of a line.

You might set a flag like so:

re.search(r”chee+se”, string, re.I)

This would match all of the following: cheese, cheeeeese, CHEESE, CHeeSE, cHeEsE, cheeeeeeeeEEEEEEEEEse.

Practical Examples

A very common use of regular expressions to ensuring the validity of entries into text fields that need to be formatted in a certain manner, such as e-mails, password, addresses, or phone numbers.

For e-mail, something like this is an idea:

[^@]{3,}@[A-Za-z]{3,}\..[A-Za-z]{3}

Three or more of any character but an @, then an @, then three or more of any letter, then a . , then three or more letters

For an address, consider something like this:

\d+\s[-A-Za-z]+\s([-A-Za-z]+\s\d+\s)?\w+\s\w\s\d{5}(-\d{4})?

This checks first for the street number, street name, then provides for “apt.” or “ste.” or the like followed by the apt/ste number, then checks for city name, state name, and zip code, either the normal five-digit code or the expanded nine-digit code.  Between every element a whitespace is needed – so, practically, either a space or a newline.  This is only one way of checking for an address in the USA.  You could, for example, require the state to be only two upper case letters, like so: [A-Z]{2}  You may also have the state and zip code be optional, by simply grouping them and putting a ? after the closing parentheses.

Or say you wanted to search a database of address for area codes beginning with 27.  Since the zip code always comes last in a US address…

27\d{3}$

This matches strings that begin with “27″, then are followed by three digits, where then the line (as long as you remember to flag re.M) ends.

Or maybe you capture the source HTML of a website using another Python module, and you want to return all instances of JavaScript in the HTML.  Or you want to parse a text file for entry into a database.  Below is a list of useful applications of regular expressions.

Checking validity of formatted input

email

address

phone number

password

Searching database

Searching text

for specific formatting

address

phone number

SSN

for variantly spelled names

for certain kinds of words

Searching code

for uses of a certain function

for uses of certain tags

Parsing text for entry into a database

For some great exercises that’ll surely help anyone get a better grasp of using regular expressions, try the ones found here:

http://www.upriss.org.uk/python/PythonCourse.html

And now for the best part of regular expressions – they are just that.  Regular.  The syntax for regular expressions (the . and * and \d and + and ? and so on) are the same across languages, so that once you know how to write regular expressions, you can understand them no matter the language you are reading or writing!


Hunter