Metacircular thoughts

February 4, 2007

Scala makes XML processing easy

Filed under: Scala — metacircular @ 2:20 pm

Ruby and Python are great for a lot of data munging tasks, but one thing they, along with most other languages, suck at is XML. XML is not complicated but the DOM is and SAX kind of is compared to what I’m about to show you.

The basic thing to realize is that Scala has XML baked into the syntax. So we can fire up our interpreter and have XML be first-class values (and call methods on it), like so:


scala> val data = <data id="1">this is test data</data>
data: scala.xml.Elem = <data id="1">this is test data</data>

scala> data.text
line1: java.lang.String = this is test data

scala> data \\ "@id"
line2: scala.xml.NodeSeq = 1

scala> (data \\ "@id").text
line3: java.lang.String = 1

scala> <data>words here <child>more words</child></data> \\ "child"
line4: scala.xml.NodeSeq = <child>more words</child>

In the first line entered, we see that if we just type raw XML then Scala infers its type to be an XML element. The class Elem has a method text on it which retrieves the text content between the tags, as you can see in our REPL session. It also has a method \ (like Lisp, Scala is very liberal in what you can name a method; many non-alphanumeric characters can be used for method names, and in fact the actors library, which implements Erlang-style concurrency as a library, has methods called ! and !?) which retrieves the child nodes of an element. Giving an argument to the method that starts with @ makes it retrieve attributes. Finally we see that \ (or \\, whichever you prefer) can also be used to retrieve child nodes. If an element has more than one child, retrieving the child gives us a sequence we can iterate over using sequence comprehensions (Scala’s way of doing foreach from languages like Ruby, Python, C#, and Java).

Hopefully you agree that this is a saner way of accessing XML than using the DOM or SAX; by comparison, the DOM looks way too design-y. One of the Scala people wrote a paper about this stuff which shows a verbose DOM example and then the equivalent Scala code; the difference is striking.

So, OK, let’s get a real example here which is still pretty easy to digest. Weblogs.com publishes a list of recently updated blogs. It so happens that about 80% of them are spam blogs, but that’s irrelevant for our purposes here. The file they publish looks like this:


<?xml version="1.0" encoding="UTF-8"?>
<weblogUpdates version="2" updated="Tue, 06 Feb 2007 04:31:00 GMT" count="2136465">
<weblog name="name" url="http://example.com" when="n" />
<weblog name="name2" url="http://example2.com" when="n" />
...
</weblogUpdates>

You can find the pings in the last hour here; it’s about a 10-15 MB file so you should probably right click, save as rather than loading it in your browser. The one I downloaded had about 160,000 entries and was about 16.5 MB in size. That’s an appreciable size.

So, let’s just do a simple processing example where we just iterate over the pings (again, 80% of them are going to be from splogs), retrieving the URL it refers to and incrementing a loop variable. A more realistic example would be to send the data to a database, but that would obscure from the example too much. So here is some code to process the changes.xml file.


import scala.xml._

object processblogpings extends Application {
  val start = System.currentTimeMillis

  val data = XML.loadFile("changes.xml")
  Console.println("Updated: " + (data \\ "@updated").text)

  var cnt = 0
  for(val entry <- data \\ "weblog") {
    // extract the URL field but don't do anything with it
    val url = (entry \\ "@url").text
    cnt = cnt+1
  }

  Console.println("Found " + cnt + " entries")
  val end = System.currentTimeMillis
  Console.println("Took " + (end-start)/1000.0 + "s")
}

Well, that's really not too bad. We read in and parse the XML file in one line with a call to loadFile and then retrieve attributes by looping and using the \ method. Really not too bad at all, I think.

The above code took 4.2 seconds to process about 163,000 entries on my Pentium 4 2.8 Ghz. That's pretty fast, especially considering the brevity we achieved, in my opinion. Now, this library might not work for files larger than what can fit in system memory, but that's a pretty rare situation. For most situations, we can take advantage of Scala's XML savvy to make life easier. Scala is, as far as I can tell, the ultimate XML processing language. There, I said it. It's powerful because we can use XML data in conjunction with all the other constructs of a modern programming language: closures, pattern matching, objects, etc. etc.

I didn't even show any examples of doing pattern matching on XML (!). However, this should be enough to get you interested in looking further; the best source is probably the draft book on XML in Scala. Note that unlike other functional languages which claim to process XML, we can actually handle Unicode data unlike, say, OCaml or a variety of Common Lisp/Scheme distributions out there. Happy hacking!

3 Comments »

  1. I don’t think Scala is the ultimate XML processing language, even though it has very nice features. XML is often used as a configuration/customization language, and most (if not all) XML processors/representation lack precise source location information for good error reporting. I wrote a blog entry regarding this issue: http://theschemeway.blogspot.com/2007/01/why-i-hate-xml-for-dsls.html

    Comment by schemeway — February 6, 2007 @ 11:01 am | Reply

  2. Hmmm… Are you keeping up with my depth fleet I have a good fresh joke for you! How can you recognize a burned-put hippie? He used to take acid, now he takes antacid.

    Comment by bobEmbowNTeek — October 29, 2008 @ 4:26 pm | Reply

  3. [...] queries in Scala, which probably makes my pattern matching on EvElemStart unnecessarily verbose. (Here’s a blog post on the xpath technique) Also, there was no particular reason for me to use pull parsing – push parsing might have [...]

    Pingback by Monomorphic — First steps with Scala: XML pull parsing — August 13, 2009 @ 9:41 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Theme: Rubric. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.