validating XHTML+RDFa markup with Ruby
August 17, 2009
I’m working on a project right now where I’m running Cucumber tests. As part of my Cucumber features I’m checking the validity of the XHTML markup with a special step definition sprinkled throughout my scenarios. As part of that the markup_validity gem is included to perform the actual validation. When I started adding RDFa to my pages the markup of course no longer validated. I had the choice of either forgoing validating as part of my Cucumber tests and testing markup validity some other way or adding RDFa support to markup_validity. It was simple enough to fork the markup_validity github project, add support for XHTML+RDFa, and get my tests passing again. After a few corrections, Aaron Patterson (of Nokogiri fame) accepted my changes into trunk, re-released the gem and added me as a collaborator on the github project.
If you need to validate your XHTML+RDFa markup as part of your test/unit or rspec tests, please give it a try and let me know if it works for you:
gem install markup_validity
Internet Archive and just my timing
May 26, 2008
After Jonathan Rochkind’s post about the Internet Archive not providing an API, I spent part of the weekend writing a screen scraper to get at what we want from the Internet Archive for the Umlaut. Basically it uses the OpenURL metadata the Umlaut takes in to search Internet Archive by ISBN, and failing that, fall back on a title-author search.
And it was a real pain. For a user interface the Internet Archive site does some nice things like highlighting query terms. When it comes to screen scraping the page it is not so nice. I had to strip out a lot of spans used for these terms.
Now the spans that could have been helpful weren’t there. For instance a span with a good id around the creator and description would have been nice. Instead I had to determine if the item had a creator (many don’t). All that separated the creator from the description was a line break tag. Hpricot was fine for this, and luckily the documentation seemed to have improved since I last used this library (or I just understand more now). This lack of good spans with ids isn’t just bad for screen scraping but for user defined CSS styles as well. (So in this way it could be an accessibility problem for some users?)
The thing that really got me was that I couldn’t reliably search by ISBNs. Their advanced search had an “isbn” field, but I couldn’t find an ISBN that could be found that way. If there was no way to test a search I didn’t see much sense in writing something that relied on it. Instead I rigged it to just use a simple keyword query to catch ISBNs. This would pick up ISBNS in fields like the description. Problem was the ISBNs were not normalized. So sometimes ISBNs would show up without dashes (0860133923) and sometimes with (2-9527671-0-6). The method to insert dashes had to handle both 10 and 13 digit ISBNs. So my service tried searching both ISBN with and without dashes before falling back on the title-author query.
In the end it was good enough and I was happy that I was able to write my first Umlaut service. Writing the service I only had to create one file and edit two simple config files to enable the new service. Because of how the Umlaut is architected I could return my results to the view without writing any new controller or view code. I just had to provide a couple standard service methods. Then I had to learn how to add my service response and tell a listener that the service had completed whether it returned any results or not. And because of the use of background services, I could make several searches (actually multiple ISBN searches with a fallback on several title-author searches) and not worry too much about how long it took. When it finished if there were any results would show up in the view via some AJAX magic.
One other thing I learned made writing this very worthwhile. I have been using Netbeans for some time now on my Ruby projects. I finally learned how to use the integrated Ruby debugger. Saved me so much time in this case, since puts debugging certainly wouldn’t work well in a Rails project.
And then…
Today I was cleaning out some email and decided to try a link I was given to where you used to be able to query Internet Archive and get an XML response. I thought I would try this link again even though I’d tried it on Friday. Well, to my surprise it’s back.
They call it an Advanced XML Search (for Admins and Curators). A couple nice things about this. It simply exposes their Solr index. I can have Solr do my relevancy ranking and sorting. And I can use solr-ruby to make connection and query creation super easy. It can return much more metadata about each object found.
I used the web interface to create this query that returns pretty printed xml, but if you’re familiar with solr the link should look very familiar.
So it looks like I’ll be going back to the IDE to rewrite the Internet Archive service using this Solr interface. Ought to make things a lot less fragile than screen scraping. Even still, screen scraping is less fragile than a disappearing API, so I’ll likely just commit the service as written in case there are any other surprises from the Internet Archive.
ye olde booke catalogue
January 25, 2008
I recently launched my first web application. Handlist takes a file of MARC records and turns them into an alphabetically sorted book catalog of titles, authors and subject headings.
It was mostly an excuse to use merb and learn some web programming, but the gem that I wrote behind it was useful for me in a part-time job, so maybe it will be useful for someone else.
If you have a file of some MARC records handy (and less than 2MB) could you give it a test? What could I have done differently? Since I’m new to this I’m all ears. Hopefully soon I’ll post about some of the things I learned while working on this project.
rubygems without sources gem
November 20, 2007
If you uninstall the sources gem that rubygems relies on you’ll get an awful error if you try to run gem. And since sources-0.0.1 is probably best installed as a gem well… you get the picture.
If you accidentally uninstall the sources gem, create a file named sources.rb and place the following inside:
module Gem
@sources = ["http://gems.rubyforge.org"]
def self.sources
@sources
end
end
I was then able to reinstall the sources gem and be back in business.
little testing machines
August 30, 2007
I’ve been updating the xisbn rubygem to use the new api. I’ve got a little project where I could use some of the metadata that can now be returned. Ed Summers let me go off on my own experimental branch (and I see now, added me to the project–thanks, Ed).
I like what I’ve done with the first draft. I don’t know if it is what most rubyists would call beautiful code but it works. And it just doesn’t work–it passes tests. This is the first time I’ve written unit tests as I’ve gone along. I’ve written tests for already existing code and that was good, but write a method, test, repeat probably made a better initial product.
I’m pleasantly surprised at how much faster I was able to get this done because I wrote tests. No guessing that something works or writing adhoc little programs to test it.
Often I’d write a test and inadvertently write it wrong, so the test failed. For instance in one case I need an array to be passed in and instead I used a string. OK, go back and write some duck typing and raise an exception for that case then write a test to make sure it works. Now I wrote the method and got it wrong, so certainly someone approaching the library fresh will make similar mistakes and it’s good in this case that there’s a helpful error to direct them. So every mistake I make can be an opportunity for another test. Makes me more calm about making mistakes.
There’s more work I need to do on xisbn for sure and some little convenience methods I’d like to add as well as documentation, but I’m happy with what I’ve done so far and what I learned about testing.
I can also imagine that this won’t be the last change OCLC makes to their xisbn service, so having tests in place should be able to tell me right away what needs to be rewritten.
This was also a test of sorts for me. xisbn is the first little api I’ve taken a look at and tried to translate into code. I think the documentation was a great help especially in writing the tests.
zcc 0.1.0
July 24, 2007
I just released version 0.1.0 of zcc, my Ruby copy cataloging script. Lot’s of improvements and new features. Find out more here:
http://zcc.rubyforge.org/zcc.html
I’m really happy with the progress I’ve made with this release. Finally I think I’m starting to get Object Oriented programming. I rewrote much of the code to create reusable classes in the hopes that I could play with some of the same code with a simple Rails based application.
Rube Goldberg machines
July 13, 2007
I was trying to tackle a little problem. There is the right way to do it, and then a way that might just work.
ruby-marc doesn’t do character conversion from MARC-8 to UTF-8. This is a problem as most MARC records are still trapped in MARC-8 character encoding. It’s particularly a problem for something I’d like to add to my copy cataloging script.
Right now I use ruby-zoom to grab records from Z39.50 targets. These z-targets, even if they store their information as UTF-8, still usually present their records to the world as MARC-8 encoded records. In normal workflow this isn’t a problem. As a C-binding to ZOOM, ruby-zoom does character conversion through YAZ. I can grab a record and, besides checking leader byte 9 to double check the current character set, ruby-zoom does the work of conversion.
But what if I want to read in a MARC-8 encoded record from file using ruby-marc and convert it to UTF-8? I’m outta luck. And as part of my script I’d like to be able to do just that. I’d then be able to maintain the same workflow: accept records, run through a macro, edit subfields, and create a csv file for labelmaking. And there’s no straightforward way I’ve found to move a ruby-marc record object into a ruby-zoom record object to do the conversion.
So I thought what if I could get the record from file into ruby-zoom first, do the conversion and then accept the record as a marc object? So in order to mock something up really quick I created a fake SRU server with WEBrick that always responded with the contents of the file supplied on the command line. The file was in MARC-8 and to work with SRU it was converted to MARCXML–except with the MARC-8 character encoding still. Very bad idea I’m sure to have MARC-8 XML.
I wanted to put the server into a thread that would stop when the program terminated and then in the same script slurp up the record using zoom, convert to UTF-8 and then move the record into a ruby-marc record object. For some reason I couldn’t do both things from the same script.
So then I wrote another script that could use SRU in zoom to grab the awful MARC-8 MARCXML record, do character conversion and move it over to a marc record. Even if I did a system call from the first script to this other script it failed. So I opened up another terminal to run the slurping script.
In the end I got it working. Who knows if the character encoding on it would have been any good. The main problem was how slow it was and how I needed to use two different scripts to get it to work. I learned a lot about SRU and WEBrick, but in the end it was a Rube Goldberg machine.
Soon I hope to share my current idea that’s forming on how to solve this problem for ZCC. Let’s hope it doesn’t end up the same mess as this one.