<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN" "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">
<article id="RegExp">
  <articleinfo>
    <title>Regular Expressions</title>
    <author>
      <firstname>Ashley</firstname>
      <othername>J.S</othername>
      <surname>Mills</surname>
      <affiliation>
        <address><email>ashley@ashleymills.com</email></address>
      </affiliation>
    </author>

    <copyright>
      <year>2005</year>
      <holder role="mailto:ashley@ashleymills.com">The University Of Birmingham</holder>
    </copyright>
  </articleinfo>

  <sect1 id="RegExp-Introduction"><title>Introduction</title>
    <para>
      Pattern matching is an important topic in Computer Science, it is the process of matching defined patterns to information.  Humans use pattern matching everyday to recognise objects and faces, computers use pattern matching everyday to perform the most basic of operations, when you execute a command at the command line, some kind of pattern matching is being employed to determine what your command is asking the computer to do, pattern matching is used in compilers and programming languages.  
    </para>

    <para>
      Regular Expressions are a particular kind of pattern matching located in the Regular Language subclass of pattern matching languages.  They are considered the least complex of the pattern matching languages but are very useful.  You have probably used regular expressions before, for instance if you have specified that you want to delete <emphasis role="strong">*.*</emphasis> at the command line, referring to any basename followed by a dot followed by any extension, then you have used the concepts of regular expressions at least once.  Most of you will be aware that the <emphasis role="strong">*</emphasis> character, known as a Kleene star or asterisk, means &quot;match anything&quot; and indeed it is used in a very similar manner in the regular expressions we are about to discuss.
    </para>

    <para> 
      There are many programs out there that have some kind of builtin regular expression handling capabilities.  The thing is, they all seem to have slight syntactical variation, fortunately the concepts are identical in each case and the differences are often marginal, this text will describe the most common components of a regular expression and will present program specific examples where appropriate.
    </para>
  </sect1>

  <sect1 id="RegExp-Basics"><title>Basics</title>
    <para>
      Regular expressions consist of literal characters and meta characters, literal characters are the actual characters you want to find, meta characters are special characters, like the Kleene star, and are the core concept behind regular expressions hence we will begin this section with a brief introduction to the most common meta characters.
    </para>

    <sect2 id="RegExp-Basics-Single"><title>Single Character</title>
      <para>
        A single character such as <emphasis role="strong">Q</emphasis> is a regular expression,  it is the regular expression that matches every string that contains the character <emphasis role="strong">Q</emphasis>, so it would match Quick, Quiet and Quantum but not quick.
      </para>
    </sect2>

    <sect2 id="RegExp-Basics-Period"><title>Any Character: <emphasis role="strong">.</emphasis></title>
      <para>
        The period, or full-stop as we call it in Britain, is used to signify that any character may be replaced by it in the search, it matches any character.  For example, &quot;.t.m<emphasis role="strong"> would match atom, item and stem and probably some other words too. A fun example of using this character can be found at <ulink url="http://www.oneacross.com/">http://www.oneacross.com/</ulink> where it is used to help people find words for their crosswords, they also use the the character </emphasis>?' as an alternative. 
      </para>
    </sect2>

    <sect2 id="RegExp-Basics-Escape"><title>The Escape Character: <emphasis role="strong">\</emphasis></title>
      <para>
        <emphasis role="strong">\</emphasis> is used to signify that we want to use a meta character as a literal character, this is necessary otherwise the character in question would be interpreted as meta-data, the character that the is being escaped is the character immediately following the escape character. For example, &quot;\*&quot; would match the string containing the character that has been escaped, that is, it would match the string (or any string containing) <emphasis role="strong">*</emphasis>.
      </para>

      <para>
        The converse can also be true, sometimes <emphasis role="strong">\</emphasis> is used to signify that we want to use a literal character as a meta character, for example, within a double quoted string in an implementation that requires that meta characters are escaped.  You should read the documentation of the particular regular expression implementation you are using to find out which approach your implementation takes.
      </para>
    </sect2>

    <sect2 id="RegExp-Basics-Caret"><title>The Caret: <emphasis role="strong">^</emphasis></title>
      <para>
        <emphasis role="strong">^</emphasis>, known as a caret, is used to match the beginning of a line, so &quot;^CAPITAL&quot; would match &quot;CAPITAL's signify emphasised speech, anger or SHOUTING&quot;, it would not match &quot;Your such a CAPITAL idiot!&quot;.
      </para>
    </sect2>

    <sect2 id="RegExp-Basics-Dollar"><title>The Dollar Symbol: <emphasis role="strong">$</emphasis></title>
      <para>
        <emphasis role="strong">$</emphasis> is used to match the end of a line, so &quot;here$&quot; would match &quot;I like it here&quot; but would not match &quot;here is a potato&quot;.
      </para>
    </sect2>

    <sect2 id="RegExp-Basics-KleeneStar"><title>The Kleene star: <emphasis role="strong">*</emphasis></title>
      <para>
        <emphasis role="strong">*</emphasis> is used to match zero or more occurrences of the regular expression immediately preceding the meta character. &quot;10*&quot; would match &quot;1&quot;, &quot;10&quot;, &quot;100&quot;, &quot;1000&quot; and so on. 
      </para>
    </sect2>

    <sect2 id="RegExp-Basics-KleenePlus"><title>The Kleene plus: <emphasis role="strong">+</emphasis></title>
      <para>
        <emphasis role="strong">+</emphasis> is used to match one or more occurrences of the regular expression immediately preceding the meta character. &quot;10+&quot; would match &quot;10&quot;, &quot;100&quot;, &quot;1000&quot; and so on but would not match &quot;1&quot;. 
      </para>

      <note>
        <para>
          <emphasis role="strong">(regular expression)+</emphasis> is the same as <emphasis role="strong">(regular expression)(regular expression)*</emphasis>.
        </para>
      </note>
    </sect2>

    <sect2 id="RegExp-Basics-Range"><title>Ranges: <emphasis role="strong">[ ]</emphasis>, <emphasis role="strong">[cn-cm]</emphasis> and <emphasis role="strong">[^cn-cm]</emphasis></title>
      <para>
        <emphasis role="strong">[ ]</emphasis> is used to signify that <emphasis>any</emphasis> of the characters or expressions enclosed within them may be matched.  <emphasis role="strong">1[123]512</emphasis> would match &quot;11512&quot;, &quot;12512&quot; and &quot;13512&quot;.
      </para>

      <para>
        <emphasis role="strong">[cn-cm]</emphasis> is used to specify a range of characters (inclusively) that may be matched at this point in the regular expression. &quot;;[b-f]oo&quot; would match &quot;boo&quot;, &quot;coo&quot;, &quot;doo&quot;, &quot;eoo&quot; and &quot;foo&quot; but not &quot;goo&quot;.
      </para>

      <para>
        <emphasis role="strong">[^cn-cm]</emphasis> is used to exclude a range of characters from a match, notice that the caret has been used again, when it is used immediately after an opening <emphasis role="strong">[</emphasis> it has this special meaning, if you want to exclude the caret then you would escape it: &quot;[^\^].  &quot;[^1-8]00&quot; would match &quot;900&quot; but not any of the other three digit hundreds such as &quot;500&quot;.
      </para>
    </sect2>

    <sect2 id="RegExp-Basics-Groups"><title>Grouping: <emphasis role="strong">\( \)</emphasis></title>
      <para>
        <emphasis role="strong">\( \)</emphasis> is used to treat regular expression contained within the (escaped in this case) brackets as a group, this group can then be back referenced later like <emphasis role="strong">\1</emphasis> to refer to the first group defined.  How this is implemented in various programs that use regular expressions varies, some tools do not require you to escape the brackets, some use different conventions to back reference defined groups. For instance a program may use &quot;$1&quot; to refer to the first bracketed group instead of &quot;\1&quot;.  There may also be limits on the number of groups that can be referenced in this way, sometimes it is a maximum of nine. In the program <command>grep</command> &quot;\(a\)b\1&quot; would match &quot;aba&quot;.
      </para>
    </sect2>

    <sect2 id="RegExp-Basics-OR"><title>Alternatives: <emphasis role="strong">|</emphasis></title>
      <para>
        <emphasis role="strong">|</emphasis> is used to delimit the <emphasis role="strong">OR</emphasis> operator, in this case the operands are the regular expressions either side of it, signifying that if either the first expression <emphasis role="strong">OR</emphasis> the second expression matches, then the whole expression will match. For example &quot;^aba\|b$&quot; will match the lines &quot;aba&quot;, &quot;abb&quot; but not &quot;abc&quot;.  The <emphasis role="strong">|</emphasis> meta character may or may not need to be escaped depending on the program.
      </para>
    </sect2>

    <sect2 id="RegExp-Basics-Repetition"><title>Repetition: <emphasis role="strong">\{n\}</emphasis>, <emphasis role="strong">\{,n\}</emphasis>, <emphasis role="strong">\{n,\}</emphasis>, <emphasis role="strong">\{n,m\}</emphasis></title>

      <para>
        <emphasis role="strong">\{n\}</emphasis> is used to specify that the regular expression immediately preceding must be matched <emphasis>n</emphasis> times exactly. &quot;^10\{3\}$&quot; will match the line &quot;1000&quot; but not &quot;100&quot; or &quot;10000&quot;.
      </para>

      <para>
        <emphasis role="strong">\{,n\}</emphasis> is used to specify that the regular expression immediately preceding may be matched up to a maximum of <emphasis>n</emphasis> times. &quot;^10\{,3\}$&quot; will match the lines &quot;1&quot;, &quot;10&quot;, &quot;100&quot; and &quot;1000&quot; but will not match &quot;10000&quot;.
      </para>

      <para>
        <emphasis role="strong">\{n,\}</emphasis> is used to specify that the regular expression immediately preceding must be matched at least <emphasis>n</emphasis> times. &quot;^10\{3,\}$&quot; will match the lines &quot;1000&quot;, &quot;10000&quot;, &quot;100000&quot; and so on but will not match &quot;100&quot;.
      </para>

      <note>
        <para>
          This is an alternative to using the Kleene star and the Kleene plus, they may not be supported in your implementation.  &quot;a\{0,}\&quot; is the same as &quot;a*&quot; and &quot;a\{1,}\&quot; is the same as &quot;a+&quot;.
        </para>
      </note>

      <para>
        <emphasis role="strong">\{n,m\}</emphasis> is used to specify that the regular expression immediately preceding must be matched at least <emphasis>n</emphasis> times but may not exceed <emphasis>m</emphasis> matches. &quot;^10\{3,4\}$&quot; will match the lines &quot;1000&quot; and &quot;10000&quot; but not &quot;100&quot; or &quot;100000&quot;.  The necessity to escape the characters may vary. Not all programs support all the types of repetition described.
      </para>
    </sect2>
  </sect1>

  <sect1 id="RegExp-Grep"><title>Grep Examples</title>
    <para>
      Grep is a tool used to search text using regular expressions, its origins highlight its function, according to <ulink url="http://www.faqs.org/faqs/usenet/faq/part1/section-21.html:">http://www.faqs.org/faqs/usenet/faq/part1/section-21.html</ulink> its origins are as follows:
    </para> 

    <blockquote>
      <para>
       The original UNIX text editor &quot;ed&quot; has a construct g/re/p, where &quot;re&quot; stands for a regular expression, to Globally search for matches to the Regular Expression and Print the lines containing them.  This was so often used that it was packaged up into its own command, thus named &quot;grep&quot;.  According to Dennis Ritchie, this is the true origin of the command.
      </para>
    </blockquote>

    <para>I will present a few examples, of which the first two are based on the following text file, <filename>mb.txt</filename>:</para>

      <programlisting>
NAME       MAKE     HP YEAR PRICE
NSR250R    Honda    60 1993 £1340 
NSR250R-SP Honda    65 1994 £2000
KR1S       Kawasaki 60 1989 £1250
GSX250     Suzuki   26 1981 £300
GS250T     Suzuki   26 1982 £250
RGV250     Suzuki   60 1993 £1400
RGV250-SP  Suzuki   65 1994 £2400 
      </programlisting>

    <para>
      The examples will use the <option>-E</option> option which specifies that <command>grep</command> should expect syntax in the form of an extended regular expression.
    </para>

    <screen>
      <userinput><command>grep</command> <option>-e</option> Honda <filename>mb.txt</filename></userinput>
    </screen>

    <para>Lists all the lines that contain the text string &quot;Honda&quot;:</para>

    <screen>
NSR250R    Honda    60 1993 £1340 
NSR250R-SP Honda    65 1994 £2000      
    </screen>

    <screen>
      <userinput><command>grep</command> <option>-e</option> &quot;6. 1&quot; <filename>mb.txt</filename> | <command>grep</command> <option>-v</option></userinput>
    </screen>

    <para>Output:</para>

    <screen>
KR1S       Kawasaki 60 1989 £1250
RGV250     Suzuki   60 1993 £1400
RGV250-SP  Suzuki   65 1994 £2400 
    </screen>

    <para>
      First lists all bikes that are sixty something BHP and then pipes this to another instance of grep which excludes all the lines containing &quot;Honda&quot; with the <option>-v</option>
    </para>

    <note>
      <para>Quotes are used to preserve whitespace and are used whenever '\' is used since this is also special within the shell so needs to be hidden from the shell</para>
    </note>

    <screen>
      <userinput>ls <option>-l</option> | <command>grep</command> <option>-e</option> &quot;Aristotle\.txt&quot;</userinput>
    </screen>

    <para>Output:</para>

    <screen>
      Aristotle.txt
    </screen>

    <para>Pipes the output from a directory listing to <command>grep</command> which searches filters the lines containing &quot;Aristotle.txt&quot;, note the use of the escape character '.' to literally match '\'</para>

    <screen>
      <userinput><command>grep</command> <option>-e</option> &quot;\(101\)\1&quot; <filename>in.file</filename></userinput>
    </screen>

    <para>
      Matches the string &quot;101101&quot;, notice that &quot;101&quot; is first grouped by enclosing within an escaped opening parentheses &quot;\(&quot; and an escaped closing parentheses &quot;\)&quot;.  The first group is then referenced with &quot;\1&quot;. Something like:
    </para>

    <screen>
      <userinput><command>grep</command> <option>-e</option> &quot;1\(0\)\*&quot; <filename>in.file</filename></userinput>
    </screen>

    <para>Would match &quot;1&quot; followed by zero or more occurrences of &quot;0&quot;, whereas:</para>

    <screen>
      <userinput><command>grep</command>  <option>-e</option> &quot;1\(0\)\+&quot; <filename>in.file</filename></userinput>
    </screen>

    <para>Would match &quot;1&quot; followed by <emphasis>at least</emphasis>one occurrence of &quot;0&quot;, this is the same as:</para>

    <screen>
      <userinput><command>grep</command> <option>-e</option> &quot;1\(0\)\(\1\)*&quot; <filename>in.file</filename></userinput>
    </screen>

    <para>
      '[' followed by ']' can be used to match a range of characters and some special ranges are already defined:
    </para>

    <itemizedlist>
      <listitem>
        <para>
          [[:alnum:]] matches [0-9a-zA-Z]
        </para>
      </listitem>
      <listitem>
        <para>
          [[:alpha:]] matches [a-zA-Z]
        </para>
      </listitem>
      <listitem>
        <para>
          [[:cntrl:]] matches control characters
        </para>
      </listitem>
      <listitem>
        <para>
          [[:digit:]] matches [0-9]
        </para>
      </listitem>
      <listitem>
        <para>
          [[:lower:]] matches [a-z]
        </para>
      </listitem>
      <listitem>
        <para>
          [[:punct:]] matches punctuation characters
        </para>
      </listitem>
      <listitem>
        <para>
          [[:upper:]] matches [A-Z]
        </para>
      </listitem>
      <listitem>
        <para>
          [[:space:]] matches any white space
        </para>
      </listitem>
    </itemizedlist>

    <screen>
      <userinput><command>grep</command> <option>-e</option> &quot;\([[:alpha:]]\)\+[[:digit:]][[:upper:]]&quot;</userinput>
    </screen>

    <para>
      Would match one or more characters in the range [a-zA-Z] followed by one character in the range [0-9] followed by one character in the range [A-Z]. So it would match the string &quot;abc9Z&quot;. The number of times a pattern must be matched can be specified after the group:
    </para>

    <screen>
      <userinput><command>grep</command> <option>-e</option> &quot;\(abc\)\{3\}&quot; <filename>in.file</filename></userinput>
    </screen>

    <para>Would match any lines containing 3 occurrences of the pattern &quot;abc&quot;.</para>

    <screen>
      <userinput><command>grep</command> <option>-e</option> &quot;^\(abc\)\{22\}&quot; <filename>in.file</filename></userinput>
    </screen>

    <para>
      Would match any lines containing 2 occurrences of the pattern &quot;abc&quot;, with the restriction that the sequence must start at the beginning of a line, specified by the use of '^', there are similar commands to '^':
    </para>

    <itemizedlist>
      <listitem>
        <para>
          <emphasis role="strong">$</emphasis> matches the end of a line
        </para>
      </listitem>
      <listitem>
        <para>
          <emphasis role="strong">\</emphasis> matches the beginning of a word
        </para>
      </listitem>
      <listitem>
        <para>
          <emphasis role="strong">\&gt;</emphasis> matches the end of a word
        </para>
      </listitem>
      <listitem>
        <para>
          <emphasis role="strong">\b</emphasis> matches the empty string at the edge of a word
        </para>
      </listitem>
      <listitem>
        <para>
          <emphasis role="strong">\B</emphasis> matches the empty string provided it is <emphasis>not</emphasis> at the edge of a word.
        </para>
      </listitem>
    </itemizedlist>

    <screen>
      <userinput><command>grep</command> <option>-e</option> &quot;^\([[:alpha:]][[:alnum:]]*\)=\1&quot;</userinput>
    </screen>

    <para>
      Would match an an alpha character beginning at the start of a line followed by zero or more alphanumeric characters followed by '=' followed by the same sequence of characters that were matched before the '=', so &quot;abc=abc&quot; would be matched but &quot;abc=abd&quot; would not be matched.
    </para>

    <para>
      Suppose you wanted to match the <emphasis>h1</emphasis>, <emphasis>h2</emphasis>, <emphasis>h3</emphasis>... etc. elements in an <acronym>HTML</acronym> file.  Assume the text file <filename>html.txt</filename>:
    </para>

    <programlisting>
&lt;h1 blah=&quot;cool&quot;&lt;Title1&lt;/h1&gt;
&lt;h2&gt;Title2&lt;/h2&gt;
&lt;h3&gt;Title3&lt;/h3&gt;
&lt;h4&gt;Title4&lt;/h4&gt;
&lt;h5&gt;Title5&lt;/h5&gt;
&lt;h6&gt;Title5&lt;/h6&gt;
&lt;h1&gt;Title2&lt;/h2&gt;
    </programlisting>

    <para>One could use the following:</para>

    <screen>
      <command>grep</command> <option>-e</option> &quot;^&lt;h[1-6][^&gt;]*&gt;[^&lt;]*&lt;/h[1-6]&gt;&quot; <filename>html.txt</filename>
    </screen>

    <para>
      Which says to match, from the start of a line: &quot;&lt;h&quot; then a character from the range [1-6] then anything but the '&gt;' character (so that the opening tag may contain attributes) then anything but a '&lt;' character then &quot;&lt;/h&quot; then a character from the range [1-6] then the character '&gt;'. The output from executing this is shown below:
    </para>

    <screen>
&lt;h1 blah=&quot;cool&quot;&gt;Title1&lt;/h1&gt;
&lt;h2&gt;Title2&lt;/h2&gt;
&lt;h3&gt;Title3&lt;/h3&gt;
&lt;h4&gt;Title4&lt;/h4&gt;
&lt;h5&gt;Title5&lt;/h5&gt;
&lt;h6&gt;Title5&lt;/h6&gt;
&lt;h1&gt;Title2&lt;/h2&gt;
    </screen>

    <para>
      Which is everything that the file contained, but if you look carefully, the last line is not a valid header element because it opens with <emphasis>h1</emphasis> and closes <emphasis>h2</emphasis> so the correct regular expression would take this into account and only output lines that have matching opening and closing tags, this can be achieved as follows:
    </para>

    <screen>
      <command>grep</command> <option>-e</option> &quot;^&lt;h\([1-6]\)[^&gt;]*&gt;[^&lt;]*&lt;/h\1&gt;&quot; <filename>html.txt</filename>
    </screen>

    <para>
      The expression is the same as before but instead of having two <emphasis role="strong">[1-6]</emphasis> sections, the first <emphasis role="strong">[1-6]</emphasis> section is enclosed within &quot;\(&quot; and &quot;\)&quot; so that it can be back-referenced.  In the closing tag the group is back-referenced using <emphasis role="strong">\1</emphasis> which means that the string matched by the back-referenced group must be matched again, hence the opening and closing header tags must be of the same level.  This produces the correct output:
    </para>

    <screen>
&lt;h1 blah=&quot;cool&quot;&gt;Title1&lt;/h1&gt;
&lt;h2&gt;Title2&lt;/h2&gt;
&lt;h3&gt;Title3&lt;/h3&gt;
&lt;h4&gt;Title4&lt;/h4&gt;
&lt;h5&gt;Title5&lt;/h5&gt;
&lt;h6&gt;Title5&lt;/h6&gt;
    </screen>
  </sect1>

  <sect1 id="RegExp-Java"><title><emphasis>java.util.regex</emphasis>, Java 1.4</title>
    <para>
      <emphasis>java.util.regex</emphasis> provides classes for matching character sequences against regular expressions. The two classes of <emphasis>java.util.regex</emphasis> are <emphasis>Matcher</emphasis> and <emphasis>Pattern</emphasis>. Pattern provides the regular expression in an efficient compiled Java version. Matcher provides the methods needed to match a character sequence against a <emphasis>Pattern</emphasis>. The <emphasis>java.util.regex</emphasis> entry in the Java API 1.4 can be found at <ulink url="http://java.sun.com/j2se/1.4/docs/api/java/util/regex/package-summary.html">http://java.sun.com/j2se/1.4/docs/api/java/util/regex/package-summary.html</ulink>.
    </para>

    <para>
      So that Java regex can be illustrated efficiently, a program will be developed that takes a searchPattern and a searchString as arguments and then prints out some information regarding the application of the searchPattern to the searchString. This will promote efficient demonstration of <emphasis>java.util.regex</emphasis> compliant regular expressions by removing the need to re-compile the test class with a new searchPattern and searchString. The process of building the program will illustrate how one can use the features provided by <emphasis>java.util.regex</emphasis>. The program is shown below:
    </para>

    <programlisting>
import java.util.regex.*;
public class Regex {
   public static void main(String args[]) {
      String searchString  = &quot;&quot;,
             searchPattern = &quot;&quot;;

      if(args.length==2) {
         searchPattern = args[0]; 
         searchString  = args[1]; 
      } else {
         output(&quot;Usage:&quot;);
         output(&quot;java regex searchPattern searchString&quot;);
         System.exit(0);
      }

      Pattern p = Pattern.compile(searchPattern);
      Matcher m = p.matcher(searchString);
      boolean b = m.find();

      output(&quot;\nMatch found   : &quot;+b);
      while(b) {
         output(&quot;Match start   : &quot; + m.start());
         output(&quot;Match end     : &quot; + m.end());
         output(&quot;Match content : &quot; + m.group(0));
         if(m.groupCount()!=0) {
            for(int i=1; i&lt;=m.groupCount(); i++) {
               output(&quot;Group &quot; + i + &quot;       : &quot; + m.group(i));
            }
         }
         b = m.find();
         if(b) output(&quot;\nMatch found   : &quot;+b);
      }
   }

   private static void output(String s) {
      System.out.println(s);
   }
}
    </programlisting>

    <para>
      The program begins with the importation of the Java regular expression package <emphasis>java.util.regex</emphasis>, the Strings <emphasis>searchString</emphasis> and <emphasis>searchString</emphasis> are declared ready for their use later. The number of command line arguments is checked, if it is not equal to two the usage message is output, if it is equal to two the first command-line argument is assigned to the String variable <emphasis>searchPattern</emphasis> and the second command-line argument is assigned to the String variable <emphasis>searchString</emphasis>.
    </para>

    <programlisting>
      Pattern p = Pattern.compile(searchPattern);
      Matcher m = p.matcher(searchString);
      boolean b = m.find();
    </programlisting>

    <para>
      The Pattern is created from the <emphasis>searchPattern</emphasis> using the <emphasis>compile</emphasis> method which compiles the given regular expression into a pattern. A Matcher is created based on the recently created Pattern and the <emphasis>searchString</emphasis>, the Matcher will match instances of the <emphasis>searchPattern</emphasis> within the <emphasis>searchString</emphasis>. A boolean called <emphasis role="strong">b</emphasis> is set to the result of <emphasis>m.find()</emphasis> which is the Matcher method which attempts to find the next subsequence of input sequence that matches the Pattern defined by <emphasis>searchPattern</emphasis>. The state of the Matcher is updated upon a successful match to contain information about the match such as where it occurred in the string and the content of marked groups.
    </para>

    <programlisting>
      output(&quot;\nMatch found   : &quot;+b);
      while(b) {
         output(&quot;Match start   : &quot; + m.start());
         output(&quot;Match end     : &quot; + m.end());
         output(&quot;Match content : &quot; + m.group(0));
         if(m.groupCount()!=0) {
            for(int i=1; i&lt;=m.groupCount(); i++) {
               output(&quot;Group &quot; + i + &quot;       : &quot; + m.group(i));
            }
         }
         b = m.find();
         if(b) output(&quot;\nMatch found   : &quot;+b);
      }
    </programlisting>

    <para>
      This loop first prints whether or not a match was found, if it was then the start position of the match is output using the Matcher method <emphasis>start()</emphasis>, the end position (+1) of the match is output using the Matcher method <emphasis>end</emphasis>. The portion of <emphasis>searchString</emphasis> that matched the pattern is printed using the Matcher method <emphasis>group(int i)</emphasis> which returns the portion of <emphasis>searchString</emphasis> matched by the i'th bracketed group within the pattern, group(0) returns the portion of <emphasis>searchString</emphasis> that is matched by the whole pattern.
    </para>

    <programlisting>
        if(m.groupCount()!=0) {
            for(int i=1; i&lt;=m.groupCount(); i++) {
               output(&quot;Group &quot; + i + &quot;       : &quot; + m.group(i));
            }
         }
         b = m.find();
         if(b) output(&quot;\nMatch found   : &quot;+b);
    </programlisting>

    <para>
      If any groups were defined in the pattern, this section loops through the groups and prints out there content, group(0) is not printed because it was printed earlier. The boolean <emphasis role="strong">b</emphasis> is set to the result of the next call to <emphasis>find()</emphasis> and if it is true, indicating another instance of the pattern has been matched, the program prints that it has found a match. This condition is necessary so that at the end of all the matches, &quot;Match found    : false&quot; is not printed.
    </para>

    <para>
      The program can be downloaded from here: <ulink url="files/Regex.java">Regex.java</ulink> and it is executed like this:
    </para>

    <screen>
      <userinput><command>java</command> Regex <replaceable>patternString</replaceable> <replaceable>searchString</replaceable></userinput>
    </screen>

    <para>Here are a few examples:</para>

    <screen>
<userinput><command>java</command> Regex &quot;Hello&quot; &quot;Hello World!&quot;</userinput>

Match found   : true
Match start   : 0
Match end     : 5
Match content : Hello 
    </screen>

    <screen>
<userinput><command>java</command> Regex &quot;[Hh]ello&quot; &quot;Hello there Peter! Oh hello there James!&quot;</userinput>

Match found   : true
Match start   : 0
Match end     : 5
Match content : Hello

Match found   : true
Match start   : 22
Match end     : 27
Match content : hello
    </screen>

    <screen>
<userinput><command>java</command> Regex &quot;(H)(e)(l)(l)(o)&quot; &quot;Hello ello ello!&quot;</userinput>

Match found   : true
Match start   : 0
Match end     : 5
Match content : Hello
Group 1       : H
Group 2       : e
Group 3       : l
Group 4       : l
Group 5       : o
    </screen>

    <screen>
<userinput><command>java</command> Regex &quot;H(e(l(l(o))))&quot; &quot;Hello ello ello&quot;</userinput>

Match found   : true
Match start   : 0
Match end     : 5
Match content : Hello
Group 1       : ello
Group 2       : llo
Group 3       : lo
Group 4       : o
    </screen>

    <screen>
<userinput><command>java</command> Regex &quot;!*&quot; &quot;0+ !!!&quot;</userinput>

Match found   : true
Match start   : 0
Match end     : 0
Match content :

Match found   : true
Match start   : 1
Match end     : 1
Match content :

Match found   : true
Match start   : 2
Match end     : 2
Match content :

Match found   : true
Match start   : 3
Match end     : 6
Match content : !!!

Match found   : true
Match start   : 6
Match end     : 6
Match content :
    </screen>

    <para>
      The example shown last is quite strange in that it illustrates how each character in the input sequence matches the pattern since the pattern specifies that zero or more '!' characters should be matched, the content is empty however since zero '!' were matched, eventually the three '!'s are matched and then finally the newline produced when the command was entered is matched.
    </para>

    <screen>
<userinput><command>java</command> Regex &quot;S.*t&quot; &quot;Spontaneous combustion&quot;</userinput>

Match found   : true
Match start   : 0
Match end     : 19
Match content : Spontaneous combust

<userinput><command>java</command> Regex &quot;S.*?t&quot; &quot;Spontaneous combustion&quot;</userinput>

Match found   : true
Match start   : 0
Match end     : 5
Match content : Spont
    </screen>

    <para>
      Notice the difference between these two, in that, the first expression uses the greedy version of &quot;.*&quot; which matches as many characters as it can whilst still producing a match where as the second expression uses the reluctant version &quot;.*?&quot; which matches the least amount of characters it can whilst still producing a match.
    </para>

    <screen>
<userinput><command>java</command> Regex &quot;10{3}\b&quot; &quot;10 100 1000 10000&quot;</userinput>

Match found   : true
Match start   : 7
Match end     : 11
Match content : 1000
    </screen>

    <para>
      Matches '1' followed by three '0's. The &quot;\b&quot; matches a word boundary so that this expression does not match &quot;10000&quot;.
    </para>

    <screen>
<userinput><command>java</command> Regex &quot;10{2,}\b&quot; &quot;10 100 1000 10000&quot;</userinput>

Match found   : true
Match start   : 3
Match end     : 6
Match content : 100

Match found   : true
Match start   : 7
Match end     : 11
Match content : 1000

Match found   : true
Match start   : 12
Match end     : 17
Match content : 10000
    </screen>

    <para>Matches '1' followed by <emphasis>at least</emphasis> two '0's followed by a word boundary</para>

    <screen>
      <userinput><command>java</command> Regex &quot;10{1,3}\b&quot; &quot;10 100 1000 10000&quot;</userinput>

Match found   : true
Match start   : 0
Match end     : 2
Match content : 10

Match found   : true
Match start   : 3
Match end     : 6
Match content : 100

Match found   : true
Match start   : 7
Match end     : 11
Match content : 1000
    </screen>

    <para>
      Matches '1' followed by a minimum of one '0' and a maximum of 3 '0's followed by a word boundary.
    </para>
  </sect1>

  <sect1 id="RegExp-Emacs"><title>Emacs Regular Expressions</title>
    <para>
      Emacs has builtin regular expression support. Regular expressions may be used within searches by typing the Emacs command sequence C-M-s, this is CTRL-ALT-s on most computers. An example is shown below:
    </para>

    <figure><title>RegExp search</title>
      <mediaobject>
        <imageobject><imagedata fileref="files/images/emacsregexpsearch.png" format="PNG"/></imageobject>
      </mediaobject>
    </figure>

    <para>
      More useful however is the regular expression search and replace function. It is activated by typing C-M-%, that is CTRL-ALT-% on most computers. When this command sequence is entered, the user is asked to enter an expression to find the text to replace and then an expression to use to replace the text found. The regular expression syntax is shown below:
    </para>

    <programlisting>
Regular Expressions
any single character except a newline              .   (dot)

zero or more repeats                               *
one or more repeats                                +
zero or one repeat                                 ?
any character in the set                           [ : : :]
any character not in the set                       [^ : : :]
beginning of line                                  ^
end of line                                        $
quote a special character c                        \c
alternative (\or\)                                 \_
grouping                                           \( : : :\)
nth group                                          \n
beginning of buffer                                \`
end of buffer                                      \'
word break                                         \b
not beginning or end of word                       \B
beginning of word                                  \lt;
end of word                                        \gt;
any word-syntax character                          \w
any non-word-syntax character                      \W
character with syntax c                            \sc
character with syntax not c                        \Sc
    </programlisting>

    <para>
      If you had a <acronym>HTML</acronym> file and you wanted to replace every occurrence of &quot;&lt;table&gt;&quot; with &quot;&lt;table border=&quot;1&quot;&gt;&quot; you could use:
    </para>

    <programlisting>Query replace regexp: &lt;table&gt; with: &lt;table border=\&quot;1\&quot;&gt;</programlisting>

    <para>You can use back-references too:</para>

    <programlisting>Query replace regexp \(this\)\(.*\)\(that\) with \3\2\1</programlisting>

    <para>When operated on:</para>

    <programlisting>
Switch this and that!
Switch this and then switch that!
Take this! and take that too!
    </programlisting>

    <para>Produces:</para>

    <programlisting>
Switch that and this!
Switch that and then switch this!
Take that! and take this too!
    </programlisting>

    <para>&quot;.*&quot; comes in greedy and non greedy flavours, consider the line:</para>

    <programlisting>You are greedy not greedy!</programlisting>

    <para>Using the greedy flavour:</para>

    <programlisting>Query replace regexp Y.*y with You are</programlisting>

    <para>Causes the whole line to be replaced with &quot;You are&quot;, where as the non-greedy:</para>

    <programlisting>Query replace regexp Y.*?y with You are</programlisting>

    <para>
      Causes only the first part of the line to be replaced with &quot;You are&quot;, leaving &quot;You are not greedy!&quot;
    </para>

    <para>As a more useful example, imagine you have a tab separated file like this:</para>

    <programlisting>
NAME   AGE  SEX
MAN1   32   M
MAN2   23   M
WOMAN1 33   F
WOMAN2 34   F
    </programlisting>

    <para>The following emacs regular expression could be used to swap the last two columns around:</para>

    <programlisting>^\([^ ]+\)\([ ]+\)\([^ ]\)\([ ]+\)\(.+?\)$</programlisting>

    <para>
      The regular expression begins with '^' to say that the pattern should begin with the beginning of a line. The next block is &quot;\([^ ]+\)&quot; which says to match one or more non tab characters, in the example the tab looks like a space, this is because in emacs one would actually enter the tab character like one would any other character so I have replaced the large gap that a tab makes with a smaller one in this document, in emacs the expression looks like this:
    </para>

    <figure><title>Emacs tabs</title>
      <mediaobject>
        <imageobject><imagedata fileref="files/images/emacstabs.png" format="PNG"/></imageobject>
      </mediaobject>
    </figure>

    <para>
      Notice how emacs varies the sizes of some of the tabs, if one did not know they were tabs it would be possible to mistake them for spaces. The &quot;one or more non-tab characters&quot; pattern is enclosed between &quot;/(&quot; and &quot;/)&quot;, this is emacs regular expression grouping so that the group can be back-referenced later. The next section in the pattern is &quot;\([ ]+\)&quot; which says to match one or more tab characters and save them in a group. After this is another grouped &quot;one or more non-tab characters&quot; section followed by another grouped &quot;one or more tab characters&quot; section followed finally by a grouped &quot;one or more any characters&quot; section. Terminated with an end of line. This can be summarised to:
    </para>

    <programlisting>
      (StartOfLine)   (Non-Tab+)   (Tab+)   (Non-Tab+)   (Tab+)   (Anything+)   (EndOfLine)
                       group 1     group2     group3     group4      group5
    </programlisting>

    <para>When asked for the replacement text, this is used:</para>

    <programlisting>\1\2\5\4\3</programlisting>

    <para>
      Which replaces the line with the text from the groups, in the order specified, so that the text file is changed to:
    </para>

    <programlisting>
NAME   SEX  AGE
MAN1   M    32
MAN2   M    23
WOMAN1 F    33
WOMAN2 F    34
    </programlisting>

    <para>The columns <emphasis>sex</emphasis> and <emphasis>age</emphasis> have been swapped.</para>
  </sect1>

  <sect1 id="Regexp-References"><title>References</title>
    <itemizedlist>
      <listitem>
        <para>
          <ulink url="http://www.evolt.org/article/rating/20/22700/">http://www.evolt.org/article/rating/20/22700/</ulink>
        </para>
        <para>Evolt Regular Expression Tutorial</para>
      </listitem>

      <listitem>
        <para><ulink url="http://sitescooper.org/tao_regexps.html">http://sitescooper.org/tao_regexps.html</ulink></para>
        <para>A Tao Of Regular Expressions</para>
      </listitem>

      <listitem>
        <para><ulink url="http://www.zytrax.com/tech/web/regex.htm">http://www.zytrax.com/tech/web/regex.htm</ulink></para>
        <para>Zytrax Regular Expression Tutorial</para>
      </listitem>

      <listitem>
        <para>
          <ulink url="http://etext.lib.virginia.edu/helpsheets/regex.html">http://etext.lib.virginia.edu/helpsheets/regex.html</ulink>
        </para>
        <para>Using Regular Expressions -  Stephen Ramsay</para>
      </listitem>

      <listitem>
        <para><ulink url="http://www.grymoire.com/Unix/Regular.html">http://www.grymoire.com/Unix/Regular.html</ulink></para>
        <para>
          Regular Expressions - Bruce Barnett &amp; General Electric Company
        </para>
      </listitem>

      <listitem>
        <para>
          <ulink url="http://jakarta.apache.org/regexp/">http://jakarta.apache.org/regexp/</ulink>
        </para>
        <para>Jakarta Regexp</para>
      </listitem>

      <listitem>
        <para>
          <ulink url="http://www.grymoire.com/Unix/Regular.html">http://www.grymoire.com/Unix/Regular.html</ulink>
        </para>

        <para>Regular ExPressions - Bruce Barnett</para>
      </listitem>

      <listitem>
        <para>
          <ulink url="http://theory.uwinnipeg.ca/localfiles/infofiles/lisp/lispref_Searching_and_Matching.html">http://theory.uwinnipeg.ca/localfiles/infofiles/lisp/lispref_Searching_and_Matching.html</ulink>
        </para>
        <para>XEmacs Lisp Reference Manual - Searching and Matching</para>
      </listitem>
    </itemizedlist>
  </sect1>
</article>

