# Getting Upto Speed With DocBook

## 1. Installation

This section will detail how to install the tools required to validate and process XML DocBook documents. The tools that will be installed are; a bunch of tools and libraries called 'libxml', Saxon and FOP. The first will be used here to validate XML files and the latter two will be used to process XML files to produce other types of output. Another tool called a resolver will be installed to allow the tools to map links to files external to the computer being worked on to files local to the computer being worked on. This allows one to use the tools without an Internet connection and speeds up there execution.

### 1.1. libxml

Within this tutorial the primary purpose for installing the libxml C library will be to gain access the tools that come with it. The tools provide the means to validate and transform XML files. In this tutorial, the program xmllint will be used to validate XML DocBook files before processing. The program xsltproc can be used to transform XML files. It is a program which uses XSLT.

#### 1.1.1. Windows

Note The three links shown immediately above may be broken since it is common practice to remove old versions from a download page when they are obsoleted. Goto http://www.zlatkovic.com/pub/libxml/ instead and download the libxml2..., libxslt..., and iconv... files with the highest version numbers. Some older versions are available in the directory oldreleases on that server, should one desire them.

It is not necessary to extract the content of these zips entirely, instead the required functionality will be extracted. Create a suitable directory to contain the stuff that is about to be extracted. For example, on my home machine. If I am running a Windows system I have a directory called c:\tools which contains all the tools I install. Within c:\tools I have a directory called libxml that contains the stuff I want from these zips. Create a suitable directory to extract the desired content from the zips into.

Extract the following files from the libxml archive into the directory.

• libxml2.dll

• xmllint.exe

Extract the following files from the libxslt archive into the directory.

• libexslt.dll

• libxslt.dll

• xsltproc.exe

Extract the following files from the iconv archive into the directory.

• iconv.dll

• iconv.exe

Append \directory\you\just\unzipped\everything\to to the PATH environment variable.

You might not use all the tools but they are worth having around in case you decide you need them.

#### 1.1.2. Unix/Linux/BSD

These files are probably already installed on your system, as most modern distributions of these operating systems use XML processing for some of the more popular components. But you may wish to get the latest versions, in which case, goto ftp://xmlsoft.org/ and get the latest libxml2 and libxslt. There are gzipped tars and RPMs available, download whichever you prefer. A list of the latest files at the time of writing is shown below:

  libxml2-2.4.25.tar.gz libxml2-2.4.25-1.i386.rpm libxml2-2.4.25-1.src.rpm libxslt-1.0.21.tar.gz libxslt-1.0.21-1.i386.rpm libxslt-1.0.21-1.src.rpm 

The ftp directory also contains devel versions of the software, this is for people who want to develop with libxml.

### 1.2. FOP

FOP(Formatting Objects Processor) is used to transform FO files to files of other formats. In this tutorial it is used to transform FO output produced by xsltproc into PDF which is a well known format considered by many to be aesthetically pleasing. The Unix and Windows installation paths are very similar, the differences will be mentioned where appropriate.

On Windows, append /directory/where/you/unzipped/fop/fop.bat to the PATH environment variable.

On Unix, append /directory/where/you/unzipped/fop/fop.sh to the PATH environment variable.

#### 1.2.1. Install Jimi

On Windows, rename JimiProClasses.zip from the archive to jimi-1.0.jar and place it in the /directory/where/you/unzipped/fop/lib directory.

On Unix, rename JimiProClasses.zip from the archive to JimiProClasses.jarand place it in the /directory/where/you/unzipped/fop/lib directory.

### 1.4. XSL StyleSheets

XSL stylesheets dictate how a document written in XML should be transformed using XSLT to a particular output format. In the case of DocBook, Norman Walsh has already written, and regularly maintains some stylesheets for DocBook that provide rules for transformations from an XML DocBook document to the most commonly desirable output formats such as XHTML and PDF. The installation for Unix and Windows machines is the same.

Download the latest stylesheets from http://sourceforge.net/project/showfiles.php?group_id=21935 and unzip the zip or gzipped tar to some suitable location. If I was running a Windows system I would use c:\lib\docbook-xsl\, if I was using a Unix system I would use c:\lib\docbook\docbook-xsl. The stylesheets are now ready to use.

#### 1.4.1. Custom StyleSheets

The output produced by the stylesheets mentioned above is reasonable but the stylesheets mentioned above are a standard distribution and as a consequence seem to be designed to cater for the needs of the many, which is sensible, unfortunately. One may modify the stylesheets directly but more often one creates a customisation layer which imports the standard stylesheets and then one overrides specific aspects of the standard stylesheets or adds extra functionality within the customisation layer according to ones tastes. I have created a customisation layer which looks good enough for standard applications and am offering it to download.

Unzip the zip to where you want the customisation layer to be situated, this could be within the stylesheets directory or in separate directory. If you unzip it to the stylesheets directory the customisation layer will unzip into the directories common, fo and xhtml. If you unzip to a separate directory these directories will be created.

Wherever you unzip the zip, it is important to change the references of the imports in the files so that they reflect the state of your system, the files fo/customfo.xsl, xhtml/customxhtml.xsl and xhtml/customchunk.xsl all have references that may need to be modified. For example, the file fo/customfo.xsl has the import line:

  

Change this to point to /where/you/put/the/stylesheets/fo/docbook.xsl

Similarly, change the entry in customchunk.xsl to point to /where/you/put/the/stylesheets/xhtml/chunk.xsl and the entry in customxhtml.xsl to point to /where/you/put/the/stylesheets/xhtml/docbook.xsl. The advantage of unzipping the zip in the same location as the standard stylesheets is that the import links may be relative (the import links can always be relative assuming the stylesheets are on the same machine, but for clarity if I am using a different directory for the customisation stylesheets I will make the import references absolute).

I have only provided customisations for FO and XHTML. It will become apparent how to use the customisation layer in the section on using the tools later. The provided customisations are listed below:

• fo/customfo.xsl - Use this to generate custom FO

• xhtml/customxhtml.xsl - Use this to generate custom XHTML (segmented)

• xhtml/customxchunk.xsl - Use this to generate custom XHTML (chunked)

## 2. Using the tools to validate and transform DocBook documents

### 2.1. Using xmllint to validate an XML DocBook document

In order to check the syntactic accordance of a DocBook document with the DocBook DTD one may use xmllint.

 xmllint --valid --noout in.xml

The --valid option specifies that xmllint should validate the document against the DTD and the --noout option specifies that no output should be produced if there are no errors, hence if the document being validated is valid, xmllint will exit silently. If the document is invalid xmllint will output an error similar to this:

  docbook.xml:1: error: Start tag expected, '< not found ?xml version="1.0" encoding='ISO-8859-15'?> ^ 

Which specifies that there is a missing start tag on line one.

Note One can use the --loaddtd option to specify an external DTD to validate the file with. Also, the --nonet option can be useful to surpress fetching of DTDs files from the web if you find that your version does this by default and you don't want it too.

### 2.2. Using xsltproc to generate XHTML(Single file) output from an XML Docbook document

 xsltproc file:///path/to/docbook-xsl/xhtml/docbook.xsl in.xml > out.html

This will produce a single XHTML file according to the XSL stylesheet specifications.

### 2.4. Using xsltproc to generate FO output from an XML Docbook document

 xsltproc file:///path/to/docbook-xsl/fo/docbook.xsl in.xml > out.fo

This will produce output as an XSL FO(Formatting object), this is an intermediate file type that can be used by other programs to generate other types of output, such as PDF.

## 3. Creating an XML DocBook document

For the ultimate reference guide see DocBook: The Definitive Guide. A template for a DocBook article is shown below:

 
2002 Your name Include an abstract of the article's contents here.
Section 1 blah blah blah Section 2 blah blah blah

Note The output produced in the following examples was produced using a customisation of the stylesheets hence output on systems not implementing the same customisations may differ.

### 3.1. Common DocBook Elements

#### 3.1.1. <para>

The reference page for the para element can be found here: http://www.docbook.org/tdg/en/html/para.html. para is one of the most commonly used elements of all the DocBook elements. para's can contain block elements such as itemizedlist and mediaobject and can contain almost all inline elements. There is some debate about whether or not it is best to separate block elements from para elements, it is probably better to do so however because some processing systems have problems processing block elements within para elements. An example of a para element containing some inline elements is shown below:

  Behold the superfluous. They are always sick. They vomit their gall and call it a newspaper. - Friedrich Wilhelm Nietzsche, Twilight of the Idols 

Looks like this:

“Behold the superfluous. They are always sick. They vomit their gall and call it a newspaper.” - Friedrich Wilhelm Nietzsche, Twilight of the Idols

#### 3.1.2. <programlisting>

The reference page for the programlisting element can be found here: http://www.docbook.org/tdg/en/html/programlisting.html. The programlisting element is used to display information that should be output verbatim, that is, white space is significant. An example is shown below:

  public class HelloWorld { public static void main(String args[]) { System.out.println("Hello World!"); } } 

Is output as:

 public class HelloWorld { public static void main(String args[]) { System.out.println("Hello World!"); } } 

Notice the use of (&quot;) to represent the (") character, this is know as a character entity and is used to represent a character that is not allowed to be used directly in the document, this is because these characters are used by the XML part of the document for special purposes. These special characters are known as CDATA as apposed to PCDATA, the latter standing for Parsed Character DATA. If one wants to use lots of CDATA characters in a document then one can wrap the section in a CDATA section like this:

  >> characters that would normally require being marked up as entities. ]]> 

Is displayed as:

  One can get away with using lots of &&& """ ''' <<< >>> characters that would normally have to be marked up as entities. 

#### 3.1.4. <screen>

The reference page for the screen element can be found here: http://www.docbook.org/tdg/en/html/screen.html. Often one wants to illustrate the use of a program or a commandline, the screen element is intended to mark content up as text that a user would see on a computer screen. An example is shown below:

  java org.apache.fop.apps.Fop in.fo out.pdf 

Is displayed as:

  java org.apache.fop.apps.Fop in.fo out.pdf 

The reference page for the ulink element can be found here: http://www.docbook.org/tdg/en/html/ulink.html. ulink is the DocBook equivalent of HTML's "<a href="...">blah blah</a>", an example is shown below:

  http://www.oasis-open.org/committees/docbook/ 

Displays as:

#### 3.1.6. Lists

##### 3.1.6.1. <itemizedlist>

The reference page for itemizedlist is here: http://www.docbook.org/tdg/en/html/itemizedlist.html. Itemized lists are standard bulleted lists and should be used where order of evaluation of the items of the list is not significant, ordered lists should be used where order of evaluation fot he items of the list are significant. An example use of itemized list is shown below:

  Books Donald E. Knuth - The Art Of Computer Programming Nils J. Nilsson - Artificial Intelligence: A New Synthesis Pure Mathematics 2 - Geoff Mannall, Michael Kenwood Games Chess Backgammon Noughs And Crosses 

Which looks like this:

• Books

• Donald E. Knuth - The Art Of Computer Programming

• Nils J. Nilsson - Artificial Intelligence: A New Synthesis

• Pure Mathematics 2 - Geoff Mannall, Michael Kenwood

• Games

• Chess

• Backgammon

• Noughs And Crosses

##### 3.1.6.2. <orderedlist>

The reference page for orderedlist is here: http://www.docbook.org/tdg/en/html/orderedlist.html. Ordered lists are used to specify a sequence of steps of which the order of evaluation is significant. The general form of an ordered list is like this:

  Action A Action B 

Which would look like this:

1. Action A

2. Action B

One may also specify the type of enumeration that the list will display, there are five types of enumeration; arabic, loweralpha, lowerroman, upperalpha , upperroman. The type of enumeration is specified via the numeration attribute like this:

  ... . . . 

The types of enumeration are shown below:

Arabic:

1. arabic

2. arabic

3. arabic

Loweralpha:

1. loweralpha

2. loweralpha

3. loweralpha

Lowerroman:

1. lowerroman

2. lowerroman

3. lowerroman

Upperalpha:

1. upperalpha

2. upperalpha

3. upperalpha

Upperroman:

1. upperroman

2. upperroman

3. upperroman

These can be combined to make nested enumeration clearer:

  Preparation Chop tomatoes Peel onions Mash potatoes Cooking Boil water Put tomatoes and onions in Blanch for 5 minutes Cleanup Throw away scraps Clean side Wash hands 

Which looks like this:

1. Preparation

1. Chop tomatoes

2. Peel onions

3. Mash potatoes

2. Cooking

1. Boil water

2. Put tomatoes and onions in

3. Blanch for 5 minutes

3. Cleanup

1. Throw away scraps

2. Clean side

3. Wash hands

One may also make the enumeration continue at lower nested levels by setting the continuation attribute to continues:

  Do this And this And this And this And this And this And this And this 

Which looks like this:

1. Do this

1. And this

2. And this

3. And this

2. And this

1. And this

2. And this

3. And this

Note Some stylesheets may define that nested lists are of a different numeration by default.

#### 3.1.7. Some common inline elements

Some common inline elements and their output are shown below:

ExampleDisplays as
<emphasis>Emphasised Text</emphasis>Emphasised Text
<emphasis role="strong">A different type of emphasis</emphasis>A different type of emphasis
<filename>blahblah.txt</filename>blahblah.txt
<acronym>XML</acronym>XML
<quote>blahblahblah<quote>“blahblahblah”

### 3.2. Including Images

Images are included in DocBook documents as illustrated below:

  

The overall encapsulating element is figure the reference page for which can be found at http://www.docbook.org/tdg/en/html/figure.html. The figure contains a mediaobject element which can occur on it's own too and may contain audioobject, caption, imageobject, objectinfo, textobject and videoobject elements. The reference page formediaobject is at http://www.docbook.org/tdg/en/html/mediaobject.html.

imageobject is the type of mediaobject used to include an image and it's reference page can be found at http://www.docbook.org/tdg/en/html/imageobject.html. The item within the imageobject that handles the image is imagedata, it's reference page is at http://www.docbook.org/tdg/en/html/imagedata.html.

The idea behind mediaobject is to provide a way to include media in many formats. It becomes the document processors job to decide which of the formats specified in the mediaobject to use in the particular output medium chosen. For example the mediaobject element may contain a PNG format imageobject for HTML output and a TIFF format imageobject for print output, there may also be a textobject providing a description of the image for an output format that does not have the capability to display images, for example, perhaps the document will be output in an audio format for people with sight problems.

One does not have to encapsulate the mediaobject in a figure object but doing so allows one to provide a title and be able to have the figure listed in a list of figures at the beginning of the document.

imagedata may be of the following formats:

The attribute format is thus required along with either fileref or entityref to reference the image:

  A frog 

One could use stylesheets such that, in HTML rendered output, the phrase used in the textobject would become the alternative text in an image in the HTML. One can use multiple imageobjects for different output formats, for instance one may have an eps version of the image so that output can be generated with a processing chain that requires the image to be in this form. One could include different image formats for each of the desired output formats.

The imagedata element has the useful attributes align and valign. align specifies how the image should be aligned horizontally and can be set to the values; center, left and right. valign specifies how the image should be aligned vertically and can be set to the values; bottom, middle and top.

### 3.3. Tables

There are two elements used for placing tables inside a DocBook document, table and informaltable, the only difference between the former and the latter is that the former requires a title and the latter does not.

  title . . . or . . . 

The table contains an attribute called frame which specifies how the table should be framed:

  frame="frametype" a1b1c1 a2b2c2a3b3c3 

Where frametype is replaced with one of all, bottom, none, sides, top or topbot:

The output above is PDF, with HTML all the tables look the same as the one with attribute all apart from the one with attribute none which has no frame at all. The attributes colsep and rowsep are used to control whether lines should be drawn between columns and rows respectively:

  ... ... ... ... 

The generic layout for a table is as follows:

  title blah blah blah 

tgroup contains the rest of the table which must contain a tbody element which specifies which data is in the body of the table. The tbody element may be empty with the table being included in thead or tfoot but this is not the intention. The reason for the thead and tfoot elements is so that different layouts can be applied by the stylesheets for the header and the footer of the table respectively. So usually the first row would be wrapped in a thead element. tgroup has the mandatory attribute cols which specifies the number of columns the table has.

tgroup may also specify alignment of content via the align attribute, where alignment is either left, center or right:

  align="alignment" a2b2c2 

A row consists of a number of entry elements which are entered in the sequence they should appear in each table row, for more information about the row element see http://www.docbook.org/tdg/en/html/row.html.

The entry element has some interesting attributes which allow an entry to span more than one column or row, they are (namest & nameend) and morerows respectively. The morerow attribute specifies how many more rows the entry it is applied to should span:

  <emphasis>morerows</emphasis> example a1b1c1b2c2b3c3 

Unfortunately there is no morecolumns attribute, instead one has to use namest to specify the starting column of the entry and nameend to specify the ending column of the entry. The value applied to this attribute is the name of the columns, columns are named using the colspec element, colspec elements are inserted inside tgroup but before thead, tbody and tfoot:

  column spanning a1a2b1c1a3b2c2 

More information about the entry element can be found at http://www.docbook.org/tdg/en/html/entry.html. Tables may be nested to a level of one, see http://www.docbook.org/tdg/en/html/entrytbl.html. For the entire source and output pertaining to the examples discussed in this section see Table Examples.