Archive for the ‘ruby’ Category

validating XHTML+RDFa markup with Ruby

August 17, 2009

I’m working on a project right now where I’m running Cucumber tests. As part of my Cucumber features I’m checking the validity of the XHTML markup with a special step definition sprinkled throughout my scenarios. As part of that the markup_validity gem is included to perform the actual validation. When I started adding RDFa to my pages the markup of course no longer validated. I had the choice of either forgoing validating as part of my Cucumber tests and testing markup validity some other way or adding RDFa support to markup_validity. It was simple enough to fork the markup_validity github project, add support for XHTML+RDFa, and get my tests passing again. After a few corrections, Aaron Patterson (of Nokogiri fame) accepted my changes into trunk, re-released the gem and added me as a collaborator on the github project.

If you need to validate your XHTML+RDFa markup as part of your test/unit or rspec tests, please give it a try and let me know if it works for you:

gem install markup_validity

Internet Archive and just my timing

May 26, 2008

After Jonathan Rochkind’s post about the Internet Archive not providing an API, I spent part of the weekend writing a screen scraper to get at what we want from the Internet Archive for the Umlaut. Basically it uses the OpenURL metadata the Umlaut takes in to search Internet Archive by ISBN, and failing that, fall back on a title-author search.

And it was a real pain. For a user interface the Internet Archive site does some nice things like highlighting query terms. When it comes to screen scraping the page it is not so nice. I had to strip out a lot of spans used for these terms.

Now the spans that could have been helpful weren’t there. For instance a span with a good id around the creator and description would have been nice. Instead I had to determine if the item had a creator (many don’t). All that separated the creator from the description was a line break tag. Hpricot was fine for this, and luckily the documentation seemed to have improved since I last used this library (or I just understand more now). This lack of good spans with ids isn’t just bad for screen scraping but for user defined CSS styles as well. (So in this way it could be an accessibility problem for some users?)

The thing that really got me was that I couldn’t reliably search by ISBNs. Their advanced search had an “isbn” field, but I couldn’t find an ISBN that could be found that way. If there was no way to test a search I didn’t see much sense in writing something that relied on it. Instead I rigged it to just use a simple keyword query to catch ISBNs. This would pick up ISBNS in fields like the description. Problem was the ISBNs were not normalized. So sometimes ISBNs would show up without dashes (0860133923) and sometimes with (2-9527671-0-6). The method to insert dashes had to handle both 10 and 13 digit ISBNs. So my service tried searching both ISBN with and without dashes before falling back on the title-author query.

In the end it was good enough and I was happy that I was able to write my first Umlaut service. Writing the service I only had to create one file and edit two simple config files to enable the new service. Because of how the Umlaut is architected I could return my results to the view without writing any new controller or view code. I just had to provide a couple standard service methods. Then I had to learn how to add my service response and tell a listener that the service had completed whether it returned any results or not. And because of the use of background services, I could make several searches (actually multiple ISBN searches with a fallback on several title-author searches) and not worry too much about how long it took. When it finished if there were any results would show up in the view via some AJAX magic.

One other thing I learned made writing this very worthwhile. I have been using Netbeans for some time now on my Ruby projects. I finally learned how to use the integrated Ruby debugger. Saved me so much time in this case, since puts debugging certainly wouldn’t work well in a Rails project.

And then…

Today I was cleaning out some email and decided to try a link I was given to where you used to be able to query Internet Archive and get an XML response. I thought I would try this link again even though I’d tried it on Friday. Well, to my surprise it’s back.

They call it an Advanced XML Search (for Admins and Curators). A couple nice things about this. It simply exposes their Solr index. I can have Solr do my relevancy ranking and sorting. And I can use solr-ruby to make connection and query creation super easy. It can return much more metadata about each object found.

    I used the web interface to create this query that returns pretty printed xml, but if you’re familiar with solr the link should look very familiar.

    http://homeserver7.us.archive.org:8983/solr/select?q=title%3A%22golden+fleece%22%3B&qin=title%3A%22golden+fleece%22&fl=collection%2Cidentifier%2Cmediatype%2Cpublisher%2Csubject%2Ctitle&wt=xml&rows=50&indent=yes

    So it looks like I’ll be going back to the IDE to rewrite the Internet Archive service using this Solr interface. Ought to make things a lot less fragile than screen scraping. Even still, screen scraping is less fragile than a disappearing API, so I’ll likely just commit the service as written in case there are any other surprises from the Internet Archive.

    rubygems without sources gem

    November 20, 2007

    If you uninstall the sources gem that rubygems relies on you’ll get an awful error if you try to run gem. And since sources-0.0.1 is probably best installed as a gem well… you get the picture.
    If you accidentally uninstall the sources gem, create a file named sources.rb and place the following inside:

    module Gem
    @sources = ["http://gems.rubyforge.org"]
    def self.sources
    @sources
    end
    end

    I was then able to reinstall the sources gem and be back in business.

    WorldCat Identities Change

    September 25, 2007

    I meant to get this out a week or two ago. Seems as if there’s been a little, but possibly significant change in the way WorldCat Identities works. Before you could get directly to a WCID record by knowing the proper base link and the NACO normalization of the name (1XX). The NACO normalization seems to be what some FRBRizations use to match identities. Just matching normalized strings. This value was marked up in the xml of the Identities records as the ‘pnkey.’ For most names you could very easily normalize it and get directly to the record you want, especially if you had a MARC record that already had authority work done on it. The resulting link would look something like: http://orlabs.oclc.org/SRW/search/Identities?query=local.pnkey+exact+%22fforde,%20jasper%22

    I was hoping to use WCID as a way to easily grab MARCXML authority records from the Linked Authority File (see the bottom of any WCID record’s page for the link). I thought this would be a nice feature to add into my little Ruby copy cataloging script. It would have been nice to have the ability to grab authority records at the same time as cataloging and maybe enrich the bibliographic record with such things as a link to wikipedia or back to the WCID record.

    To make sure I was able to generate the links properly I was working on a little Ruby library to do the NACO normalizations correctly. I invested a lot of time in learning how to pack a hex number representing the UTF-16 value for a letter into the UTF-8 version. The python example for NACO normalizations used the UTF-16 value so instead of looking up and calculating all the UTF-8 octals I wanted to just use these values directly. I spent time with Iconv for Ruby trying to get diacritic stripping transliteration to work before realizing that it wasn’t very portable since it relied on system locales. In fact I could get Iconv to do proper transliteration to ASCII from UTF-8 in irb, but as soon as I tried it in a script it failed replacing everything with a diacritic with a ‘?’. I took a look at the ICU4R library but couldn’t get it to compile on my system. Finally I fell back on the older Unicode gem to do decomposing normalizations and then strip out every byte above 127 which would include all those diacritics. Of course there’s an OCLC web service for NACO normalization but I wanted to learn some of this stuff anyways.

    I think I had it mostly working when I wanted to take a look at some of the WCID pnkeys to make sure I had all the spacing correct for a good match. The site seemed down since a link directly to one of the records I often took a look at threw an ugly error. Since the site has been in beta I’ve often seen these types of errors, so I went to the homepage instead. The homepage came up. Searching a name also threw an error so I clicked on the tag cloud.

    I discovered that now WCID has changed the pnkey to be the 010$a from the LC authority record with “lccn-” tacked onto the front. The links look nicer (more RESTful?): http://orlabs.oclc.org/Identities/key/lccn-n79-21164 But you’re no longer able to get directly to the record unless you already know the LC control number for the authority record.

    It’s still to be seen what other kind of valid links you might be able to create with WCID, but for now it’s not looking good for what I had hoped to do. I can’t get to an authority record I don’t have if I need information from that record to get there. They’ve cut off my one simple route to the information I want. It may be that I’ll have to do a two step process: search first and then have a way to pick the proper record. More cumbersome than I had hoped for sure. It looks to me as if OCLC may have made WCID less generally useful. It’s a beautiful project but if I can’t have an API to it or have to wade through a couple step process and parse all the xml myself, then it’s not really very useful.

    It’ll be interesting to see what happens but I don’t know that I’ll be touching anything from OCLC in the future until it’s out of beta. Lesson learned.

    Update: Searching WCID now works. It seems for entities without authority records they still use a normalized form of the name as their pnkey.

    little testing machines

    August 30, 2007

    I’ve been updating the xisbn rubygem to use the new api. I’ve got a little project where I could use some of the metadata that can now be returned. Ed Summers let me go off on my own experimental branch (and I see now, added me to the project–thanks, Ed).

    I like what I’ve done with the first draft. I don’t know if it is what most rubyists would call beautiful code but it works. And it just doesn’t work–it passes tests. This is the first time I’ve written unit tests as I’ve gone along. I’ve written tests for already existing code and that was good, but write a method, test, repeat probably made a better initial product.

    I’m pleasantly surprised at how much faster I was able to get this done because I wrote tests. No guessing that something works or writing adhoc little programs to test it.

    Often I’d write a test and inadvertently write it wrong, so the test failed. For instance in one case I need an array to be passed in and instead I used a string. OK, go back and write some duck typing and raise an exception for that case then write a test to make sure it works. Now I wrote the method and got it wrong, so certainly someone approaching the library fresh will make similar mistakes and it’s good in this case that there’s a helpful error to direct them. So every mistake I make can be an opportunity for another test. Makes me more calm about making mistakes.

    There’s more work I need to do on xisbn for sure and some little convenience methods I’d like to add as well as documentation, but I’m happy with what I’ve done so far and what I learned about testing.

    I can also imagine that this won’t be the last change OCLC makes to their xisbn service, so having tests in place should be able to tell me right away what needs to be rewritten.

    This was also a test of sorts for me. xisbn is the first little api I’ve taken a look at and tried to translate into code. I think the documentation was a great help especially in writing the tests.