Regular Expressions


Table of Contents

1. Introduction
2. Basics
2.1. Single Character
2.2. Any Character: .
2.3. The Escape Character: \
2.4. The Caret: ^
2.5. The Dollar Symbol: $
2.6. The Kleene star: *
2.7. The Kleene plus: +
2.8. Ranges: [ ], [cn-cm] and [^cn-cm]
2.9. Grouping: \( \)
2.10. Alternatives: |
2.11. Repetition: \{n\}, \{,n\}, \{n,\}, \{n,m\}
3. Grep Examples
4. java.util.regex, Java 1.4
5. Emacs Regular Expressions
6. References

Pattern matching is an important topic in Computer Science, it is the process of matching defined patterns to information. Humans use pattern matching everyday to recognise objects and faces, computers use pattern matching everyday to perform the most basic of operations, when you execute a command at the command line, some kind of pattern matching is being employed to determine what your command is asking the computer to do, pattern matching is used in compilers and programming languages.

Regular Expressions are a particular kind of pattern matching located in the Regular Language subclass of pattern matching languages. They are considered the least complex of the pattern matching languages but are very useful. You have probably used regular expressions before, for instance if you have specified that you want to delete *.* at the command line, referring to any basename followed by a dot followed by any extension, then you have used the concepts of regular expressions at least once. Most of you will be aware that the * character, known as a Kleene star or asterisk, means "match anything" and indeed it is used in a very similar manner in the regular expressions we are about to discuss.

There are many programs out there that have some kind of builtin regular expression handling capabilities. The thing is, they all seem to have slight syntactical variation, fortunately the concepts are identical in each case and the differences are often marginal, this text will describe the most common components of a regular expression and will present program specific examples where appropriate.

Regular expressions consist of literal characters and meta characters, literal characters are the actual characters you want to find, meta characters are special characters, like the Kleene star, and are the core concept behind regular expressions hence we will begin this section with a brief introduction to the most common meta characters.

Grep is a tool used to search text using regular expressions, its origins highlight its function, according to http://www.faqs.org/faqs/usenet/faq/part1/section-21.html its origins are as follows:

The original UNIX text editor "ed" has a construct g/re/p, where "re" stands for a regular expression, to Globally search for matches to the Regular Expression and Print the lines containing them. This was so often used that it was packaged up into its own command, thus named "grep". According to Dennis Ritchie, this is the true origin of the command.

I will present a few examples, of which the first two are based on the following text file, mb.txt:

NAME       MAKE     HP YEAR PRICE
NSR250R    Honda    60 1993 £1340 
NSR250R-SP Honda    65 1994 £2000
KR1S       Kawasaki 60 1989 £1250
GSX250     Suzuki   26 1981 £300
GS250T     Suzuki   26 1982 £250
RGV250     Suzuki   60 1993 £1400
RGV250-SP  Suzuki   65 1994 £2400 
      

The examples will use the -E option which specifies that grep should expect syntax in the form of an extended regular expression.

      grep -e Honda mb.txt
    

Lists all the lines that contain the text string "Honda":

NSR250R    Honda    60 1993 £1340 
NSR250R-SP Honda    65 1994 £2000      
    
      grep -e "6. 1" mb.txt | grep -v
    

Output:

KR1S       Kawasaki 60 1989 £1250
RGV250     Suzuki   60 1993 £1400
RGV250-SP  Suzuki   65 1994 £2400 
    

First lists all bikes that are sixty something BHP and then pipes this to another instance of grep which excludes all the lines containing "Honda" with the -v

[Note]Note

Quotes are used to preserve whitespace and are used whenever '\' is used since this is also special within the shell so needs to be hidden from the shell

      ls -l | grep -e "Aristotle\.txt"
    

Output:

      Aristotle.txt
    

Pipes the output from a directory listing to grep which searches filters the lines containing "Aristotle.txt", note the use of the escape character '.' to literally match '\'

      grep -e "\(101\)\1" in.file
    

Matches the string "101101", notice that "101" is first grouped by enclosing within an escaped opening parentheses "\(" and an escaped closing parentheses "\)". The first group is then referenced with "\1". Something like:

      grep -e "1\(0\)\*" in.file
    

Would match "1" followed by zero or more occurrences of "0", whereas:

      grep  -e "1\(0\)\+" in.file
    

Would match "1" followed by at leastone occurrence of "0", this is the same as:

      grep -e "1\(0\)\(\1\)*" in.file
    

'[' followed by ']' can be used to match a range of characters and some special ranges are already defined:

  • [[:alnum:]] matches [0-9a-zA-Z]

  • [[:alpha:]] matches [a-zA-Z]

  • [[:cntrl:]] matches control characters

  • [[:digit:]] matches [0-9]

  • [[:lower:]] matches [a-z]

  • [[:punct:]] matches punctuation characters

  • [[:upper:]] matches [A-Z]

  • [[:space:]] matches any white space

      grep -e "\([[:alpha:]]\)\+[[:digit:]][[:upper:]]"
    

Would match one or more characters in the range [a-zA-Z] followed by one character in the range [0-9] followed by one character in the range [A-Z]. So it would match the string "abc9Z". The number of times a pattern must be matched can be specified after the group:

      grep -e "\(abc\)\{3\}" in.file
    

Would match any lines containing 3 occurrences of the pattern "abc".

      grep -e "^\(abc\)\{22\}" in.file
    

Would match any lines containing 2 occurrences of the pattern "abc", with the restriction that the sequence must start at the beginning of a line, specified by the use of '^', there are similar commands to '^':

  • $ matches the end of a line

  • \ matches the beginning of a word

  • \> matches the end of a word

  • \b matches the empty string at the edge of a word

  • \B matches the empty string provided it is not at the edge of a word.

      grep -e "^\([[:alpha:]][[:alnum:]]*\)=\1"
    

Would match an an alpha character beginning at the start of a line followed by zero or more alphanumeric characters followed by '=' followed by the same sequence of characters that were matched before the '=', so "abc=abc" would be matched but "abc=abd" would not be matched.

Suppose you wanted to match the h1, h2, h3... etc. elements in an HTML file. Assume the text file html.txt:

<h1 blah="cool"<Title1</h1>
<h2>Title2</h2>
<h3>Title3</h3>
<h4>Title4</h4>
<h5>Title5</h5>
<h6>Title5</h6>
<h1>Title2</h2>
    

One could use the following:

      grep -e "^<h[1-6][^>]*>[^<]*</h[1-6]>" html.txt
    

Which says to match, from the start of a line: "<h" then a character from the range [1-6] then anything but the '>' character (so that the opening tag may contain attributes) then anything but a '<' character then "</h" then a character from the range [1-6] then the character '>'. The output from executing this is shown below:

<h1 blah="cool">Title1</h1>
<h2>Title2</h2>
<h3>Title3</h3>
<h4>Title4</h4>
<h5>Title5</h5>
<h6>Title5</h6>
<h1>Title2</h2>
    

Which is everything that the file contained, but if you look carefully, the last line is not a valid header element because it opens with h1 and closes h2 so the correct regular expression would take this into account and only output lines that have matching opening and closing tags, this can be achieved as follows:

      grep -e "^<h\([1-6]\)[^>]*>[^<]*</h\1>" html.txt
    

The expression is the same as before but instead of having two [1-6] sections, the first [1-6] section is enclosed within "\(" and "\)" so that it can be back-referenced. In the closing tag the group is back-referenced using \1 which means that the string matched by the back-referenced group must be matched again, hence the opening and closing header tags must be of the same level. This produces the correct output:

<h1 blah="cool">Title1</h1>
<h2>Title2</h2>
<h3>Title3</h3>
<h4>Title4</h4>
<h5>Title5</h5>
<h6>Title5</h6>
    

java.util.regex provides classes for matching character sequences against regular expressions. The two classes of java.util.regex are Matcher and Pattern. Pattern provides the regular expression in an efficient compiled Java version. Matcher provides the methods needed to match a character sequence against a Pattern. The java.util.regex entry in the Java API 1.4 can be found at http://java.sun.com/j2se/1.4/docs/api/java/util/regex/package-summary.html.

So that Java regex can be illustrated efficiently, a program will be developed that takes a searchPattern and a searchString as arguments and then prints out some information regarding the application of the searchPattern to the searchString. This will promote efficient demonstration of java.util.regex compliant regular expressions by removing the need to re-compile the test class with a new searchPattern and searchString. The process of building the program will illustrate how one can use the features provided by java.util.regex. The program is shown below:

import java.util.regex.*;
public class Regex {
   public static void main(String args[]) {
      String searchString  = "",
             searchPattern = "";

      if(args.length==2) {
         searchPattern = args[0]; 
         searchString  = args[1]; 
      } else {
         output("Usage:");
         output("java regex searchPattern searchString");
         System.exit(0);
      }

      Pattern p = Pattern.compile(searchPattern);
      Matcher m = p.matcher(searchString);
      boolean b = m.find();

      output("\nMatch found   : "+b);
      while(b) {
         output("Match start   : " + m.start());
         output("Match end     : " + m.end());
         output("Match content : " + m.group(0));
         if(m.groupCount()!=0) {
            for(int i=1; i<=m.groupCount(); i++) {
               output("Group " + i + "       : " + m.group(i));
            }
         }
         b = m.find();
         if(b) output("\nMatch found   : "+b);
      }
   }

   private static void output(String s) {
      System.out.println(s);
   }
}
    

The program begins with the importation of the Java regular expression package java.util.regex, the Strings searchString and searchString are declared ready for their use later. The number of command line arguments is checked, if it is not equal to two the usage message is output, if it is equal to two the first command-line argument is assigned to the String variable searchPattern and the second command-line argument is assigned to the String variable searchString.

      Pattern p = Pattern.compile(searchPattern);
      Matcher m = p.matcher(searchString);
      boolean b = m.find();
    

The Pattern is created from the searchPattern using the compile method which compiles the given regular expression into a pattern. A Matcher is created based on the recently created Pattern and the searchString, the Matcher will match instances of the searchPattern within the searchString. A boolean called b is set to the result of m.find() which is the Matcher method which attempts to find the next subsequence of input sequence that matches the Pattern defined by searchPattern. The state of the Matcher is updated upon a successful match to contain information about the match such as where it occurred in the string and the content of marked groups.

      output("\nMatch found   : "+b);
      while(b) {
         output("Match start   : " + m.start());
         output("Match end     : " + m.end());
         output("Match content : " + m.group(0));
         if(m.groupCount()!=0) {
            for(int i=1; i<=m.groupCount(); i++) {
               output("Group " + i + "       : " + m.group(i));
            }
         }
         b = m.find();
         if(b) output("\nMatch found   : "+b);
      }
    

This loop first prints whether or not a match was found, if it was then the start position of the match is output using the Matcher method start(), the end position (+1) of the match is output using the Matcher method end. The portion of searchString that matched the pattern is printed using the Matcher method group(int i) which returns the portion of searchString matched by the i'th bracketed group within the pattern, group(0) returns the portion of searchString that is matched by the whole pattern.

        if(m.groupCount()!=0) {
            for(int i=1; i<=m.groupCount(); i++) {
               output("Group " + i + "       : " + m.group(i));
            }
         }
         b = m.find();
         if(b) output("\nMatch found   : "+b);
    

If any groups were defined in the pattern, this section loops through the groups and prints out there content, group(0) is not printed because it was printed earlier. The boolean b is set to the result of the next call to find() and if it is true, indicating another instance of the pattern has been matched, the program prints that it has found a match. This condition is necessary so that at the end of all the matches, "Match found : false" is not printed.

The program can be downloaded from here: Regex.java and it is executed like this:

      java Regex patternString searchString
    

Here are a few examples:

java Regex "Hello" "Hello World!"

Match found   : true
Match start   : 0
Match end     : 5
Match content : Hello 
    
java Regex "[Hh]ello" "Hello there Peter! Oh hello there James!"

Match found   : true
Match start   : 0
Match end     : 5
Match content : Hello

Match found   : true
Match start   : 22
Match end     : 27
Match content : hello
    
java Regex "(H)(e)(l)(l)(o)" "Hello ello ello!"

Match found   : true
Match start   : 0
Match end     : 5
Match content : Hello
Group 1       : H
Group 2       : e
Group 3       : l
Group 4       : l
Group 5       : o
    
java Regex "H(e(l(l(o))))" "Hello ello ello"

Match found   : true
Match start   : 0
Match end     : 5
Match content : Hello
Group 1       : ello
Group 2       : llo
Group 3       : lo
Group 4       : o
    
java Regex "!*" "0+ !!!"

Match found   : true
Match start   : 0
Match end     : 0
Match content :

Match found   : true
Match start   : 1
Match end     : 1
Match content :

Match found   : true
Match start   : 2
Match end     : 2
Match content :

Match found   : true
Match start   : 3
Match end     : 6
Match content : !!!

Match found   : true
Match start   : 6
Match end     : 6
Match content :
    

The example shown last is quite strange in that it illustrates how each character in the input sequence matches the pattern since the pattern specifies that zero or more '!' characters should be matched, the content is empty however since zero '!' were matched, eventually the three '!'s are matched and then finally the newline produced when the command was entered is matched.

java Regex "S.*t" "Spontaneous combustion"

Match found   : true
Match start   : 0
Match end     : 19
Match content : Spontaneous combust

java Regex "S.*?t" "Spontaneous combustion"

Match found   : true
Match start   : 0
Match end     : 5
Match content : Spont
    

Notice the difference between these two, in that, the first expression uses the greedy version of ".*" which matches as many characters as it can whilst still producing a match where as the second expression uses the reluctant version ".*?" which matches the least amount of characters it can whilst still producing a match.

java Regex "10{3}\b" "10 100 1000 10000"

Match found   : true
Match start   : 7
Match end     : 11
Match content : 1000
    

Matches '1' followed by three '0's. The "\b" matches a word boundary so that this expression does not match "10000".

java Regex "10{2,}\b" "10 100 1000 10000"

Match found   : true
Match start   : 3
Match end     : 6
Match content : 100

Match found   : true
Match start   : 7
Match end     : 11
Match content : 1000

Match found   : true
Match start   : 12
Match end     : 17
Match content : 10000
    

Matches '1' followed by at least two '0's followed by a word boundary

      java Regex "10{1,3}\b" "10 100 1000 10000"

Match found   : true
Match start   : 0
Match end     : 2
Match content : 10

Match found   : true
Match start   : 3
Match end     : 6
Match content : 100

Match found   : true
Match start   : 7
Match end     : 11
Match content : 1000
    

Matches '1' followed by a minimum of one '0' and a maximum of 3 '0's followed by a word boundary.

Emacs has builtin regular expression support. Regular expressions may be used within searches by typing the Emacs command sequence C-M-s, this is CTRL-ALT-s on most computers. An example is shown below:

More useful however is the regular expression search and replace function. It is activated by typing C-M-%, that is CTRL-ALT-% on most computers. When this command sequence is entered, the user is asked to enter an expression to find the text to replace and then an expression to use to replace the text found. The regular expression syntax is shown below:

Regular Expressions
any single character except a newline              .   (dot)

zero or more repeats                               *
one or more repeats                                +
zero or one repeat                                 ?
any character in the set                           [ : : :]
any character not in the set                       [^ : : :]
beginning of line                                  ^
end of line                                        $
quote a special character c                        \c
alternative (\or\)                                 \_
grouping                                           \( : : :\)
nth group                                          \n
beginning of buffer                                \`
end of buffer                                      \'
word break                                         \b
not beginning or end of word                       \B
beginning of word                                  \lt;
end of word                                        \gt;
any word-syntax character                          \w
any non-word-syntax character                      \W
character with syntax c                            \sc
character with syntax not c                        \Sc
    

If you had a HTML file and you wanted to replace every occurrence of "<table>" with "<table border="1">" you could use:

Query replace regexp: <table> with: <table border=\"1\">

You can use back-references too:

Query replace regexp \(this\)\(.*\)\(that\) with \3\2\1

When operated on:

Switch this and that!
Switch this and then switch that!
Take this! and take that too!
    

Produces:

Switch that and this!
Switch that and then switch this!
Take that! and take this too!
    

".*" comes in greedy and non greedy flavours, consider the line:

You are greedy not greedy!

Using the greedy flavour:

Query replace regexp Y.*y with You are

Causes the whole line to be replaced with "You are", where as the non-greedy:

Query replace regexp Y.*?y with You are

Causes only the first part of the line to be replaced with "You are", leaving "You are not greedy!"

As a more useful example, imagine you have a tab separated file like this:

NAME   AGE  SEX
MAN1   32   M
MAN2   23   M
WOMAN1 33   F
WOMAN2 34   F
    

The following emacs regular expression could be used to swap the last two columns around:

^\([^ ]+\)\([ ]+\)\([^ ]\)\([ ]+\)\(.+?\)$

The regular expression begins with '^' to say that the pattern should begin with the beginning of a line. The next block is "\([^ ]+\)" which says to match one or more non tab characters, in the example the tab looks like a space, this is because in emacs one would actually enter the tab character like one would any other character so I have replaced the large gap that a tab makes with a smaller one in this document, in emacs the expression looks like this:

Notice how emacs varies the sizes of some of the tabs, if one did not know they were tabs it would be possible to mistake them for spaces. The "one or more non-tab characters" pattern is enclosed between "/(" and "/)", this is emacs regular expression grouping so that the group can be back-referenced later. The next section in the pattern is "\([ ]+\)" which says to match one or more tab characters and save them in a group. After this is another grouped "one or more non-tab characters" section followed by another grouped "one or more tab characters" section followed finally by a grouped "one or more any characters" section. Terminated with an end of line. This can be summarised to:

      (StartOfLine)   (Non-Tab+)   (Tab+)   (Non-Tab+)   (Tab+)   (Anything+)   (EndOfLine)
                       group 1     group2     group3     group4      group5
    

When asked for the replacement text, this is used:

\1\2\5\4\3

Which replaces the line with the text from the groups, in the order specified, so that the text file is changed to:

NAME   SEX  AGE
MAN1   M    32
MAN2   M    23
WOMAN1 F    33
WOMAN2 F    34
    

The columns sex and age have been swapped.