XML, SAX, and DOM

Manhattan, New York City,

======================================================================

IBM developerWorks XML Tip

July 2, 2002

Vol. 2, Issue 27

 

IBM's resource for developers

http://www-106.ibm.com/developerworks/?nx-722

======================================================================

TIP: STOP A SAX PARSER WHEN YOU HAVE ENOUGH DATA -

Use SAX data without having to parse the entire document

 

 Nicholas Chase (nicholas@nicholaschase.com)

 President, Chase and Chase, Inc.

 

Hello, XML Tip readers,

A SAX parser can be instructed to stop midway through a document

without losing the data already collected. This is one of the most

commonly mentioned advantages of a SAX parser over a DOM parser, which

generally creates an in-memory structure of the entire document. In

this tip, you'll parse a list of recently updated weblogs, stopping

when you've displayed all those within a particular time range.

 

For the rest of the tip, read on below.

 

Until next week,

XML Tip team at IBM developerWorks

dWnews@us.ibm.com

 

Note: This tip uses JAXP. The classes are also part of the Java 2 SDK

1.4, so if you have 1.4 installed, you don't need any additional

software. You can download the source file for this article at

http://www.weblogs.com/changes.xml.

______________________________________________________________________

HOW A SAX PARSER WORKS

 

The Simple API for XML (SAX) is an event-based API. It examines an XML

file, character by character, and translates it into a series of

events, such as startDocument() and endElement(). A ContentHandler

object processes these events, taking appropriate action. An

ErrorHandler object takes care of any warnings or errors that arise

during the parsing. The main application (see Listing 1) assigns these

objects to the XMLReader object:

---------------------------------------------------------------------- -

Listing 1. The main application

---------------------------------------------------------------------- -

import org.xml.sax.helpers.XMLReaderFactory;

import org.xml.sax.XMLReader;

import org.xml.sax.SAXException;

import org.xml.sax.InputSource;

import java.io.IOException;

 

public class MainSaxApp {

 

  public static void main (String[] args){

 

   try {

  

     String parserClass = "org.apache.crimson.parser.XMLReaderImpl";

     XMLReader reader = XMLReaderFactory.createXMLReader(parserClass);

 

     WeblogHandler content = new WeblogHandler();

       ErrorProcessor errors = new ErrorProcessor();

 

       reader.setContentHandler(content);

       reader.setErrorHandler(errors);

 

       InputSource file = new InputSource("changes.xml");

     reader.parse(file);

 

   } catch (IOException ioe) {

       System.out.println("IO Exception: "+ioe.getMessage());

   } catch (SAXException se) {

       System.out.println(se.getMessage());

   }

 

  }

 

}

 

The parse() method simply sends the events to the content object,

which then deals with them.

______________________________________________________________________

THE HANDLERS

 

For this application, all of the work will be done by the WeblogHandler

object, which processes the XML file. The changes.xml file itself is

fairly simple, with all of the actual data contained in attributes:

---------------------------------------------------------------------- -

Listing 2. A portion of the data file

---------------------------------------------------------------------- -

<?xml version="1.0"?>

<weblogUpdates version="1" updated="Sat, 15 Jun 2002 22:25:06 GMT"

                       count="592697">

   <weblog name="Enigmatic Mermaid"

           url="http://pombostrans.blogspot.com" when="28"/>

   <weblog name="The Vanguard Science Fiction Report"

           url="http://www.vanguardreport.com" when="852"/>

   <weblog name="Flummox.com" url="http://www.flummox.com/"

           when="10713"/>

</weblogUpdates>

 

This is just a snippet of the actual file, but it shows the structure:

Attributes include the name, the URL, and the time since the weblog was

updated, in seconds. The content handler takes some of that information

and outputs it to the window:

---------------------------------------------------------------------- -

Listing 3. The content handler

---------------------------------------------------------------------- -

import org.xml.sax.helpers.DefaultHandler;

import org.xml.sax.Attributes;

 

public class WeblogHandler extends DefaultHandler

{

   public WeblogHandler ()

   {

     super();

   }

 

   int numLogs = 0;

   public void startElement (String namespaceUri, String localName,

       String qualifiedName, Attributes attributes) {

  

     if (localName.equals("weblog")) {

         String logName = attributes.getValue("name");

         String secsAgo = attributes.getValue("when");

         numLogs = numLogs + 1;

         System.out.println(numLogs + ") " + logName + " updated "

                                         + secsAgo + " seconds ago.");

     }

   }

 

   public void endDocument(){

       System.out.println();

       System.out.println("All recorded logs displayed.");

       System.out.println("More may have been updated within"

                             + " the appropriate timeframe.");

   }

 

}

 

In this case the error handler is trivial, simply alerting you to the

presence of an error or warning. The source files include the file in

its entirety.

______________________________________________________________________

RUNNING THE APPLICATION

 

When you actually run MainSaxApp, all of the data in changes.xml is

passed through to content, which outputs the appropriate information,

as seen in Figure 1 in the full text of this tip on the web

(see Links to other good stuff, below).

 

Notice that in Figure 1, the entire file has been parsed, as evidenced

by the execution of the endDocument() method.

______________________________________________________________________

STOPPING THE PARSER

 

As you can see, a significant number of weblogs have been updated in

the three-hour period that changes.xml tracks. Suppose that you want to

allow the user to enter a number of seconds representing the interval

in which he or she is interested. To do that, you'll look at the first

argument on the command line, passing it in to the content object.

(You'll look at the corresponding changes to WeblogHandler.java in

a moment.)

---------------------------------------------------------------------- -

Listing 4. Changes to MainSaxApp.java

---------------------------------------------------------------------- -

...

     String parserClass = "org.apache.crimson.parser.XMLReaderImpl";

     XMLReader reader = XMLReaderFactory.createXMLReader(parserClass);

 

     WeblogHandler content = new WeblogHandler();

       int numSecs = new Integer(args[0]).intValue();

       content.setNumSecs(numSecs);

 

     ErrorProcessor errors = new ErrorProcessor();

 

     reader.setContentHandler(content);

     reader.setErrorHandler(errors);

...

 

Of course, these changes won't mean anything unless you change the

WeblogHandler class:

---------------------------------------------------------------------- -

Listing 5. Changes to WeblogHandler.java

---------------------------------------------------------------------- -

import org.xml.sax.helpers.DefaultHandler;

import org.xml.sax.Attributes;

import org.xml.sax.SAXException;

 

public class WeblogHandler extends DefaultHandler

{

  public WeblogHandler ()

  { super(); }

 

  //-------------

  //UTILITY METHODS

  //-------------

  int numSecs = 0;

  public void setNumSecs(int arg) {

   numSecs = arg;

  }

 

  //-------------

  //EVENT METHODS

  //-------------

  int numLogs = 0;

  public void startElement (String namespaceUri, String localName,

     String qualifiedName, Attributes attributes)

           throws SAXException {

  

     if (localName.equals("weblog")) {

       String logName = attributes.getValue("name");

       String logURL = attributes.getValue("url");

      

       int secsAgo =

           new Integer(attributes.getValue("when")).intValue();

       if (secsAgo > numSecs) {

       throw new SAXException("\nLimit reached after

           "+numLogs+" entries.");

       } else {

           numLogs = numLogs + 1;

           System.out.println(numLogs + ") " + logName + " updated "

                                     + secsAgo + " seconds ago.");

         }

     }

   }

 

   public void endDocument(){

       System.out.println();

       System.out.println("All recorded logs displayed.");

       System.out.println("More may have been updated within"

                             + " the appropriate timeframe.");

   }

 

}

 

First, add the setNumSecs() method for the argument. Next, retrieve the

when attribute as an int rather than as a String. Fortunately,

changes.xml is sorted based on the when attribute, so all you have to

do is compare the current secsAgo to numSecs; if secsAgo exceeds

numSecs, you want to stop parsing.

 

In order to stop parsing, you throw a new SAXException, creating it

with a message that includes the number of logs processed so far. So

what happens when you run it?

______________________________________________________________________

RUNNING THE NEW APPLICATION

 

Now, if you run the new application with an argument of, say, five

minutes (for example, using java MainSaxApp 300) you can see the

difference, as shown in Figure 2 in the full text of this tip

on the web (see Links to other good stuff, below).

 

So what is actually happening here? You entered an argument of 300

seconds, so when the first weblog that was updated more than 300

seconds ago is reached, the startElement() method throws the

SAXException. Because there's no try-catch block to catch that

exception, startElement() throws it to the calling environment,

which is the reader's parse() method called in MainSaxApp. There's

nothing to catch it there either, so it goes to the MainSaxApp's

main() method, where that try-catch block outputs the passed

message.

 

The main point is this: Because the application threw the exception,

the parser stopped -- as evidenced by the fact that the endDocument()

method was never executed -- but you still had all of the information

it had already encountered.

______________________________________________________________________

NEXT STEPS

 

This tip demonstrates a simple application that includes a SAX parser

that stops when it encounters a particular condition. Here, you have

simply used a generic SAXException, but there's nothing to stop you

from creating your own exceptions for different business conditions

and building their use into your logic. (You'd also want to perform

a lot more error checking when using the command-line argument!)

 

======================================================================

LINKS TO OTHER GOOD STUFF

 

::: IBM developerWorks XML Zone :::

http://www-106.ibm.com/developerworks/xml/?nx-722

 

::: Resources related to this tip :::

http://www.ibm.com/developer/library/x-tipsaxstop/index.html#resources

 

::: Full text of this tip on the Web :::

http://www.ibm.com/developer/library/x-tipsaxstop/index.html/?nx-722

 

::: Index of other XML tips :::

http://www-106.ibm.com/developerworks/library/x-tips.html?nx-722

 

::: Most recent issue of the IBM developerWorks newsletter:

http://www.ibm.com/developerworks/newsletter/dwte062702.html?nx-722

 

======================================================================

ABOUT THIS NEWSLETTER

Created by IBM developerWorks (http://www.ibm.com/developerworks/)

Delivered by Topica (http://www.topica.com/tep/index.html)

======================================================================

Subscribe: http://www-106.ibm.com/developerworks/newsletter/?n-about

Unsubscribe: http://ibm.email-publisher.com/u/?a84vCg.bacXyq

Get help: mailto:customersupport@ibmdw.email-publisher.com

Send comments:

http://www-105.ibm.com/developerworks/newcontent.nsf/dW_feedback/

IBM's privacy policy: http://www.ibm.com/privacy/

IBM's copyright and trademark information:

http://www.ibm.com/legal/copytrade.phtml

 

THIS NEWSLETTER IS FOR INFORMATION ONLY. This newsletter should not

be interpreted to be a commitment on the part of IBM, and, after the

publication date, IBM cannot guarantee the accuracy of any information

presented. You may copy and distribute this newsletter, as long as:

 

1. All text is copied without modification and all pages are included.

2. All copies contain IBM's copyright notice and any other notices

provided therein.

3. This document is not distributed for profit.

 

 


[Home] [Curriculum Vitae] [Photo Galeries] [Papers] [Agent Based Modelling] [Mathematics] [Computer Science] [XML]

 Copyright © 2000-2004 Jean-Marc Gulliet. All Rights Reserved.
Updated on Wednesday, July 10, 2002 @ 04:51:27 PM