Author Archive

This Blog Has Moved

January 28, 2012 Leave a comment

I’ve changed over to using Octopress. I like having my content under version control. The workflow is simpler, and I can use a simple text editor and Markdown, which is nice. You can find Preliminary Inventory of Digital Collections at its new home.

Categories: Uncategorized

microdata, Rails, and microdatajs

June 11, 2011 2 comments

With the recent release of and its use of microdata, I wanted to try to add some microdata to my pages. In development it would be nice to confirm what metadata can be extracted, but the tools I found were lacking.

There are few tools right now that I have found that will allow you to see the microdata. There’s the Rich Snippets Testing Tool from Google, but you have to enter a URL which does not work well in development before a site is even up on any public server. The other current problem with the Rich Snippets tool is that it has yet to be updated for some of the item types. While the microdata parsing will work and display properly, the possible rich snippets will not show up.

I found an extension for chrome that displays an icon when microdata is detected on a page and allows you to explore down through the tree, but the javascript chokes on my page for some reason, even though I have no reason to believe that it is invalid. On other pages it seems to work well enough if look a bit crude.

The mida gem allows for parsing microdata, but only works under Ruby 1.9, and my current project still relies on Ruby 1.8.7. It was easy enough with rvm to set up a separate gemset under Ruby 1.9 and do something like the follow:

$ irb
> require 'mida'
> require 'pp'
> require 'open-uri'
> doc = nil
> url = "http://localhost:3000/"
> open(url) {|f| doc =, url)}; pp doc.items; nil

That last nil insures that all of the items aren’t displayed again non-pretty printed, so you only see the pretty printed version. The problem is the “pretty” version isn’t so easy to read. It was also annoying to have to move back and forth between my code editor and a terminal just to run this.

The Dive into HTML5 book presents some good links at the end of the chapter on microdata. One link is to Live Microdata which allows you to enter whole pages or snippets containing microdata and it will parse it into JSON and Turtle. You can download the source code for this and easily load up the live microdata page and enter snippets.

What’s better is that microdatajs includes two javascript implementations of a microdata parser. I thought I could include this within my project and display the result at the bottom of the page. Since I use livereload, any time I make a change the browser would reload and the microdata parsing would also be triggered. I decided to use the non-jQuery version because it was fewer files to have to include in my project and the README says that this version “mimics as closely as possible the syntax and behavior of the API defined in the HTML specification.”

Here’s what I did very quickly to do this in Rails. First, grab microdata.js and microdata.json.js and place them in /public/javascripts. Include the following snippet in the head of your application.html.erb template:

<%= render :partial => '/layouts/microdata' if Rails.env == 'development' %>

Within /app/views/layouts/_microdata.html.erb place the following:

<%= javascript_include_tag 'microdata', 'microdata.json' %>
    var items = getJSON(document.getItems());
    $('body').append('<pre><code>' + items + '</code></pre>');


The script does rely on jQuery to wait for the document ready, but that could be rewritten to work without it.

Now when you reload your page with embedded microdata you should see some pretty printed JSON at the bottom of the page with the microdata items from that page. I added prettify for JSON highlighting, but it doesn’t add much benefit so I left it out of the above.

Hopefully more and better tools will be coming to help develop microdata rich sites. If you know of a good tool or browser extension for developing sites with microdata, I would be very interested in hearing about them.

Categories: javascript, microdata, rails, ruby

validating XHTML+RDFa markup with Ruby

August 17, 2009 Leave a comment

I’m working on a project right now where I’m running Cucumber tests. As part of my Cucumber features I’m checking the validity of the XHTML markup with a special step definition sprinkled throughout my scenarios. As part of that the markup_validity gem is included to perform the actual validation. When I started adding RDFa to my pages the markup of course no longer validated. I had the choice of either forgoing validating as part of my Cucumber tests and testing markup validity some other way or adding RDFa support to markup_validity. It was simple enough to fork the markup_validity github project, add support for XHTML+RDFa, and get my tests passing again. After a few corrections, Aaron Patterson (of Nokogiri fame) accepted my changes into trunk, re-released the gem and added me as a collaborator on the github project.

If you need to validate your XHTML+RDFa markup as part of your test/unit or rspec tests, please give it a try and let me know if it works for you:

gem install markup_validity
Categories: ruby, rubygems, testing Tags: , ,

Internet Archive and just my timing

May 26, 2008 1 comment

After Jonathan Rochkind’s post about the Internet Archive not providing an API, I spent part of the weekend writing a screen scraper to get at what we want from the Internet Archive for the Umlaut. Basically it uses the OpenURL metadata the Umlaut takes in to search Internet Archive by ISBN, and failing that, fall back on a title-author search.

And it was a real pain. For a user interface the Internet Archive site does some nice things like highlighting query terms. When it comes to screen scraping the page it is not so nice. I had to strip out a lot of spans used for these terms.

Now the spans that could have been helpful weren’t there. For instance a span with a good id around the creator and description would have been nice. Instead I had to determine if the item had a creator (many don’t). All that separated the creator from the description was a line break tag. Hpricot was fine for this, and luckily the documentation seemed to have improved since I last used this library (or I just understand more now). This lack of good spans with ids isn’t just bad for screen scraping but for user defined CSS styles as well. (So in this way it could be an accessibility problem for some users?)

The thing that really got me was that I couldn’t reliably search by ISBNs. Their advanced search had an “isbn” field, but I couldn’t find an ISBN that could be found that way. If there was no way to test a search I didn’t see much sense in writing something that relied on it. Instead I rigged it to just use a simple keyword query to catch ISBNs. This would pick up ISBNS in fields like the description. Problem was the ISBNs were not normalized. So sometimes ISBNs would show up without dashes (0860133923) and sometimes with (2-9527671-0-6). The method to insert dashes had to handle both 10 and 13 digit ISBNs. So my service tried searching both ISBN with and without dashes before falling back on the title-author query.

In the end it was good enough and I was happy that I was able to write my first Umlaut service. Writing the service I only had to create one file and edit two simple config files to enable the new service. Because of how the Umlaut is architected I could return my results to the view without writing any new controller or view code. I just had to provide a couple standard service methods. Then I had to learn how to add my service response and tell a listener that the service had completed whether it returned any results or not. And because of the use of background services, I could make several searches (actually multiple ISBN searches with a fallback on several title-author searches) and not worry too much about how long it took. When it finished if there were any results would show up in the view via some AJAX magic.

One other thing I learned made writing this very worthwhile. I have been using Netbeans for some time now on my Ruby projects. I finally learned how to use the integrated Ruby debugger. Saved me so much time in this case, since puts debugging certainly wouldn’t work well in a Rails project.

And then…

Today I was cleaning out some email and decided to try a link I was given to where you used to be able to query Internet Archive and get an XML response. I thought I would try this link again even though I’d tried it on Friday. Well, to my surprise it’s back.

They call it an Advanced XML Search (for Admins and Curators). A couple nice things about this. It simply exposes their Solr index. I can have Solr do my relevancy ranking and sorting. And I can use solr-ruby to make connection and query creation super easy. It can return much more metadata about each object found.

    I used the web interface to create this query that returns pretty printed xml, but if you’re familiar with solr the link should look very familiar.

    So it looks like I’ll be going back to the IDE to rewrite the Internet Archive service using this Solr interface. Ought to make things a lot less fragile than screen scraping. Even still, screen scraping is less fragile than a disappearing API, so I’ll likely just commit the service as written in case there are any other surprises from the Internet Archive.

    ye olde booke catalogue

    January 25, 2008 Leave a comment

    I recently launched my first web application. Handlist takes a file of MARC records and turns them into an alphabetically sorted book catalog of titles, authors and subject headings.

    It was mostly an excuse to use merb and learn some web programming, but the gem that I wrote behind it was useful for me in a part-time job, so maybe it will be useful for someone else.

    If you have a file of some MARC records handy (and less than 2MB) could you give it a test? What could I have done differently? Since I’m new to this I’m all ears. Hopefully soon I’ll post about some of the things I learned while working on this project.

    Categories: Uncategorized Tags: , ,

    rubygems without sources gem

    November 20, 2007 1 comment

    If you uninstall the sources gem that rubygems relies on you’ll get an awful error if you try to run gem. And since sources-0.0.1 is probably best installed as a gem well… you get the picture.
    If you accidentally uninstall the sources gem, create a file named sources.rb and place the following inside:

    module Gem
    @sources = [""]
    def self.sources

    I was then able to reinstall the sources gem and be back in business.

    Categories: ruby, rubygems

    WorldCat Identities Change

    September 25, 2007 3 comments

    I meant to get this out a week or two ago. Seems as if there’s been a little, but possibly significant change in the way WorldCat Identities works. Before you could get directly to a WCID record by knowing the proper base link and the NACO normalization of the name (1XX). The NACO normalization seems to be what some FRBRizations use to match identities. Just matching normalized strings. This value was marked up in the xml of the Identities records as the ‘pnkey.’ For most names you could very easily normalize it and get directly to the record you want, especially if you had a MARC record that already had authority work done on it. The resulting link would look something like:,%20jasper%22

    I was hoping to use WCID as a way to easily grab MARCXML authority records from the Linked Authority File (see the bottom of any WCID record’s page for the link). I thought this would be a nice feature to add into my little Ruby copy cataloging script. It would have been nice to have the ability to grab authority records at the same time as cataloging and maybe enrich the bibliographic record with such things as a link to wikipedia or back to the WCID record.

    To make sure I was able to generate the links properly I was working on a little Ruby library to do the NACO normalizations correctly. I invested a lot of time in learning how to pack a hex number representing the UTF-16 value for a letter into the UTF-8 version. The python example for NACO normalizations used the UTF-16 value so instead of looking up and calculating all the UTF-8 octals I wanted to just use these values directly. I spent time with Iconv for Ruby trying to get diacritic stripping transliteration to work before realizing that it wasn’t very portable since it relied on system locales. In fact I could get Iconv to do proper transliteration to ASCII from UTF-8 in irb, but as soon as I tried it in a script it failed replacing everything with a diacritic with a ‘?’. I took a look at the ICU4R library but couldn’t get it to compile on my system. Finally I fell back on the older Unicode gem to do decomposing normalizations and then strip out every byte above 127 which would include all those diacritics. Of course there’s an OCLC web service for NACO normalization but I wanted to learn some of this stuff anyways.

    I think I had it mostly working when I wanted to take a look at some of the WCID pnkeys to make sure I had all the spacing correct for a good match. The site seemed down since a link directly to one of the records I often took a look at threw an ugly error. Since the site has been in beta I’ve often seen these types of errors, so I went to the homepage instead. The homepage came up. Searching a name also threw an error so I clicked on the tag cloud.

    I discovered that now WCID has changed the pnkey to be the 010$a from the LC authority record with “lccn-” tacked onto the front. The links look nicer (more RESTful?): But you’re no longer able to get directly to the record unless you already know the LC control number for the authority record.

    It’s still to be seen what other kind of valid links you might be able to create with WCID, but for now it’s not looking good for what I had hoped to do. I can’t get to an authority record I don’t have if I need information from that record to get there. They’ve cut off my one simple route to the information I want. It may be that I’ll have to do a two step process: search first and then have a way to pick the proper record. More cumbersome than I had hoped for sure. It looks to me as if OCLC may have made WCID less generally useful. It’s a beautiful project but if I can’t have an API to it or have to wade through a couple step process and parse all the xml myself, then it’s not really very useful.

    It’ll be interesting to see what happens but I don’t know that I’ll be touching anything from OCLC in the future until it’s out of beta. Lesson learned.

    Update: Searching WCID now works. It seems for entities without authority records they still use a normalized form of the name as their pnkey.