Update: The Scala mailing list has interesting suggestions on any number of ways to improve this code. It is left as an exercise to the reader to evaluate the merits of each approach.
I’d like to revisit an old post on using the Jakarta Commons HTTP client. Specifically, I’d like to show how to do something approaching polite HTTP retrieval in Scala.
What does “polite” retrieval imply? It means using conditional GET and gzip compression to minimize the amount of bandwidth used when retrieving a resource. This is especially relevant if you’re going to write an RSS aggregator. A widely read post, Conditional GET for RSS Hackers, explains the rationale of conditional GET. When we retrieve something that changes from time to time, we keep track of the last ETag/Last-Modified header values and then supply that in the future. If the resource hasn’t changed, we’ll get a response code of 304 indicating as much. We pass these in the If-None-Match and If-Modified-Since headers in the HTTP GET request.
Gzip compression causes the server to transfer the resource we want to retrieve in compressed, gzipped form, substantially reducing the amount of bandwidth we need to use. We indicate that we can accept it by adding a header of the form Accept-Encoding -> gzip to our request.
Furthermore, it’s desirable to have a custom timeout amount and have a custom number of retries.
I haven’t seen any code samples out there put all these techniques out there for the Jakarta HTTP client, or even any showing how to use and decompress gzipped stuff over HTTP, so I’m going go through the code to do that in Scala. This would easily apply over to Java. In fact I have to warn you that the code is going to be ugly and Java-ish.
We start out by having a function httpget that takes in our parameters and returns (1) the response code (2) the body, if any (3) the ETag response header if any and (4) the Last-Modified response header if any.
def httpget(uri: String, timeout: int, retry_cnt: int, etag: String,
last_modified: String): {int, String, String, String} = {
We declare a HTTPClient instance, set the timeout as per our parameter and set the retry count as well. We tell HTTPClient to accept redirects, and then in a key step, we tell it we will accept gzip-encoded content.
val client = new HttpClient
// our parameter is in seconds, HTTPClient uses milliseconds
client.getHttpConnectionManager.getParams
.setConnectionTimeout(timeout*1000)
val m = new GetMethod(uri)
m.getParams.setParameter(HttpMethodParams.RETRY_HANDLER,
new DefaultHttpMethodRetryHandler(retry_cnt, false))
m.addRequestHeader("Accept-Encoding", "gzip")
m setFollowRedirects true
Then we give it our ETag and Last-Modified data if we got any.
if(etag != "")
m.addRequestHeader("If-None-Match", etag)
if(last_modified != "")
m.addRequestHeader("If-Modified-Since", last_modified)
Next, we actually do the request and get the ETag/Last-Modified response headers.
var code = 0
try { code = client executeMethod m}
catch { case e: Exception => Console.println("Error: " + e) }
// XXX this is akward
var new_etag = ""
if(m.getResponseHeader("ETag") != null)
new_etag = m.getResponseHeader("ETag").getValue
var new_lm = ""
if(m.getResponseHeader("Last-Modified") != null)
new_lm = m.getResponseHeader("Last-Modified").getValue
Now we can finally deal with the response. If the server supports gzip, we wrap the response body stream object around a Java class, GZIPInputStream. We then in turn wrap this around a BufferedReader class. Then we read from the BufferedReader one line at a time in a loop. It’s kind of lousy but it gets the job done. Finally, we release the connection and return our data using the new tuple syntax in Scala 2.3.3.
var resp = ""
if(code == 200) {
var str = m.getResponseBodyAsStream
if(str != null) {
val enc_hdr = m.getResponseHeader("Content-Encoding")
if (enc_hdr != null && enc_hdr.getValue.equalsIgnoreCase("gzip"))
str = new GZIPInputStream(m.getResponseBodyAsStream)
var data = new StringBuffer
val buf = new BufferedReader(new InputStreamReader(str))
// XXX this is also akward
var line = "ignore"
while(line != null) {
line = buf.readLine
if(line != null) data.append(line)
}
str.close
resp = data.toString
}
}
m.releaseConnection
{code, resp, new_etag, new_lm}
Something about this code smells but I’m not sure what to do with it; I’m pretty new to the JVM. If any old-time Java programmers want to get some Bileblog-esque Java frustration out and tell me exactly where I’m going wrong, I’d appreciate it.
OK, let’s write a driver function to call this code so we can experiment with it. This should be self-explanatory.
def main(args: Array[String]) {
if(args.length < 3) {
System.err.println("Usage: scala -Dlog4j.ignoreTCL politehttpget <URL> <ETag> <Last-Modified>")
exit(1)
}
Console.println("Retrieving " + args(0))
var uri = args(0)
var site_etag = args(1)
var site_lm = args(2)
Console.println("Using ETag: " + site_etag)
Console.println("Using Last-Modified: " + site_lm)
httpget(args(0), 10, 3, site_etag, site_lm) match {
case {n, data, etag, lm} =>
Console.println("Got response code " + n)
Console.println("Read data of length " + data.length)
if(etag.length > 0)
Console.println("ETag: " + etag)
if(lm.length > 0)
Console.println("Last modified: " + lm)
}
}
The -Dlog4j.ignoreTCL is to make log4j not output annoying, unintelligible errors. I don’t know why I have to do that and I don’t care, I just know that doing that makes it shut the fuck up. Here are some examples of calling the program.
$ scala -Dlog4j.ignoreTCL politehttpget http://metacircular.wordpress.com/feed "" ""
Retrieving http://metacircular.wordpress.com/feed
Using ETag:
Using Last-Modified:
Got response code 200
Read data of length 42749
ETag: "0c13d35331869cd6df1d3add0b25f3c3"
Last modified: Tue, 06 Feb 2007 23:01:32 GMT
$ scala -Dlog4j.ignoreTCL politehttpget http://metacircular.wordpress.com/feed "\\"0c13d35331869cd6df1d3add0b25f3c3\\"" "Tue, 06 Feb 2007 23:01:32 GMT"
Retrieving http://metacircular.wordpress.com/feed
Using ETag: "0c13d35331869cd6df1d3add0b25f3c3"
Using Last-Modified: Tue, 06 Feb 2007 23:01:32 GMT
Got response code 304
Read data of length 0
ETag: "0c13d35331869cd6df1d3add0b25f3c3"
Last modified: Tue, 06 Feb 2007 23:01:32 GMT
Notice how when we first retrieve the feed for this blog, we get back an ETag and a Last-Modified. When we make the same request again but supply that data to the WordPress server, we get a 304 response code, indicating that, unsurprisingly, the feed hasn’t changed since when we made the first request 15 seconds prior.
Hopefully this was useful to you. Happy hacking.
[...] down, parse, and save feeds, as well as check intelligently for updates (thanks, in part, to “Towards polite HTTP retrieval in Scala“, which gave me the idea to let the remote webservers tell me whether a feed’s changed, [...]
Pingback by Mike’s Place » Putting my money where my mouth is — February 14, 2007 @ 10:05 am |