Bibliographic Wilderness

Condividi contenuti Bibliographic Wilderness
Gone to Croatoan
Aggiornato: 8 ore 41 min fa

flot in a hidden div

Gio, 22/07/2010 - 20:52

I’m using the insanely awesome Flot JQuery plotting/charting package for the soon-to-be-released range limit plugin for Blacklight.

So one problem I ran into. The place I’d like to put my Flot chart is in a div on screen that is often initially hidden, and only shown when the user expands it by clicking on a heading.

There are at least two problems with that. One is that Flot requires an explicit width and height to be set.  But I’d like to have my plugin be ‘liquid’ in it’s display of flot.  Flot is fine if you set the width and height with javascript, as long as you do it before you draw Flot. Okay, so I figure I can look up the width with JQuery.width(), compute the height using a good ratio.  Except you can’t look up the width of a hidden div, it doesn’t have one.

The other, more obvious problem, is that Flot simply won’t draw in a hidden div, even if you do explicitly set the width and height. It does all sorts of wild calculations to figure out the best places to put labels and such, and it can’t do that unless it’s placeholder container is actually in the DOM, not hidden.

So, I thought, okay, it needs to be shown, but what if it’s shown, but off screen (absolutely positioned somewhere way off the monitor). Does that work?  Well, sort of, sometimes. If I took only the plot placeholder div and moved it off screen, Flot would be willing to draw, but when I later moved it back on-screen to view it, flot’s labels and tick marks and such were all over the place, in the wrong places.

But. If I took the parent div to the flot placeholder, the one that in my page is actually being hidden and shown, and moved it off-screen… everything worked.

So here’s what I do to draw a Flot chart “off screen” without really being off-screen.   Show the parent div; calculate the width/height; move the parent div off screen, have Flot draw itself, re-hide the parent div, put it back on screen. It all happens quick enough that it’s as-if Flot were drawn in a hidden div.

Working in the four browsers. It may not exactly be a general purpose solution, because it may depend to some extent on the surrounding DOM, but it works in my DOM.

Here’s a nice little wrapper routine I wrote that, at least in my case, does the job. (using a javascript closure to wrap the actual drawing).

// example use: wrapPrepareForFlot( $(placeholder_div), $(parent_that_might_be_hidden), desired_width_to_height_ratio, function(placeholder) { //code to actually draw Flot goes here }); // definition: /* Set up dom for flot rendering: flot needs to render in a non-hidden div with explicitly set width and height. The non-hidden thing is annoying to us, since it might be in a hidden facet limit. Can we get away with moving it off-screen? Not JUST the flot container, or it will render weird. But the whole parent limit content, testing reveals we can. */ function wrapPrepareForFlot(container, parent_section, widthToHeight, call_block) { var parent_originally_hidden = $(parent_section).css("display") == "none"; if (parent_originally_hidden) { $(parent_section).show(); } $(container).width( $(parent_section).width() ); $(container).height( $(parent_section).width() * widthToHeight ); if (parent_originally_hidden) { parent_section.addClass("ui-helper-hidden-accessible"); } call_block(container); if (parent_originally_hidden) { $(parent_section).removeClass("ui-helper-hidden-accessible"); $(parent_section).hide(); } }

There’s a different approach you could take too, that I might revisit later, which would have it’s own tricks: Instead of trying to pre-render the plot in a ‘hidden’ div, don’t load it until the user actually shows the div, then start loading it. Because JS/JQuery doesn’t have a built in “onshow” event, this would take some tricks too, but should also be do-able. Wonder if there’s a JQuery plugin to provide a general purpose on-show event somehow?


Filed under: General
Categorie: LIS, stranieri

Getting publication date out of Marc

Mer, 21/07/2010 - 20:25

The SolrMarc example/default configuration tries to get a publication date out of 260$c.

This is a tricky thing to do, because you’re trying to parse not entirely coded data. And on top of that, I just discovered that dates in other calendar systems can legally appear in 260$c, if that’s how they appear on the title page. A title page has Hebrew Callendar 5750 in it? That’ll be in the 260$c. Oops.

So it’s probably better to try and get dates out of the 008 fixed field. One problem here is it’s a lot more confusing, you’ve got to get ascii decimal digits out of fixed byte positions (machine readable what?), and you really need to talk to a cataloger to get to the bottom of “date1″  and “date2″, as well as the “date types” and what they mean.

Beware f date type “q”, for “questionable date”, meaning that the publication date is somewhere in in the range of date1 and date2.  (These would seem , by examples in the OCLC documentation, to be inclusive boundaries, although the documentation doesn’t actually explicitly say that).

On top of that, dates in date1 and date2 can show up with “u”s in them for unknown digits. “19uu” means sometime in the 20th century.

And in the final note in the this is really meant to be machine readable? column, let’s say you know something was published in the 19th or 20th century.  You might think you’d use the “q” date type and put date1=1800 and date2=1999, that would certainly express what you know. But no, the OCLC examples say to put this in as “q” date type, with date1=18uu and date2=19uu. huh?

The other problem with getting dates out of 008 fixed bytes is that since so many of our traditional ILS’s completely ignore them, it’s not clear to me how correct they’ll be, since a mistake didn’t matter much before.  But in a testament to years of catalogers entering correct data even though their systems did nothing with it, the data seems at first analysis to be pretty good. I think it’s going to be better than trying to get a date from 260c, especially with the “hebrew date” issue.


Filed under: General
Categorie: LIS, stranieri

Umlaut infomercial: ‘stitching’ costs

Lun, 19/07/2010 - 02:54

Lorcan Dempsey has a blog post about ‘stitching costs’ that gives me some useful language to plug Umlaut some more.

Libraries are also familiar with high ‘integration’ costs: perhaps these might be called stitching costs. This means that it may be costly developing higher level services based on integration of various lower level services.

Umlaut is in part intended to deal with ‘stitching costs’ for one class of services: “known item services”.  And also with where ‘stitching’ and ‘switching’ intersect — once you’ve integrated a bunch of your services in a rube goldbergesque contraption, then part of the cost of switching one of the underlying services out becomes the ‘stitching’ of the new one into your overall aggregate system. An issue more and more libraries will probably start running into, as more have developed such ball-and-twine systems of integration over the past few years, possibly in such a way that the first time one of the underlying components need to be switched, it’s going to be painful.

The idea of Umlaut is that it’s a platform that you can easily write plugins for to consult various internal and external sources of data on ‘known items’ (catalog, link resolver knowledge base, worldcat, Scopus, whatever).  The platform is there for you so you can just focus on the source-specific logic in the plugin, is the idea. Then Umlaut vends the aggregated information collected through both an HTML web page, but also several APIs designed to be as easy to use as possible, so you can embed the aggregated information in whatever other services or web pages you want.

There is a full api (delivering in XML or json), as well as an API that delivers a collection of HTML snippet sections ready to be dropped in as-is on a foreign web page. Both APIs provide incremental results with information on what services are still fetching data, for polling. (As fetching data from various external services can end up being slow).

So theoretically, once set up, this should decrease both the ‘switching’ and ‘stitching’ costs of individual elements in your infrastructure.  Switch out a source of known item data? No problem, just write an Umlaut plugin for the new one. Switch out a service that consumes known item info from Umlaut and delivers it to users? No problem, the new service just needs to access Umlaut’s easy to use APIs.

Now, in practice, it is admittedly not necessarily quite so simple as I mkae it sound, as writing both writing an Umlaut plugin and making a new product access and deliver info from Umlaut’s APIs can be significant tasks — but I’m convinced that this architecture significantly lowers stitching costs, and switching costs due to stitching costs, in the long run.

As a chief example, it should theoretically allow us to consider ‘link resolvers’ just in terms of quality of knowledge base, without worrying about the link resolvers own interface. As Umlaut accesses the link resolver knowledge base via api and provides it’s own (html and api) interfaces, if we decide that there’s a better product for it’s knowledge base, we should be able to write an Umlaut plugin for the better one (as long as it has an api), switch it out, and not only will our user experience be mostly unaffected, all existing ‘stitched’ services will not have to be touched a bit, they’re still talking to Umlaut, which just has a new plugin data source.  This depends on the desired new link resolver having an PI which is sufficiently robust and performant for Umlaut’s needs, but not on any of the desired link resolvers own presentation features.


Filed under: General
Categorie: LIS, stranieri

Why a known item service infrastructure?

Ven, 16/07/2010 - 19:20

It occured to me a while ago that Umlaut isn’t just a ‘link resolver front end’, or an ‘improved link resolver’. It is those things, but when you improve a link resolver enough, and pay attention to all forms/genres (not just journals), what you get is what I’m clunkily calling a Known Item Service Provider, an additional piece of library infrastructure.

I’ve come to think that this is in fact an essential tool that most library digital infrastructure is missing. As an infrastructural tool, it’s not neccesarily designed to answer just one question for one very particular use case, it’s designed to answer the general question (for people and machine access): “What can you tell me or do for me about item X”?

Andy Powell brings up a specific question/use case that’s a sub-set of this: If I know a print book I’m interested in, and likely even know it’s ISBN, does the library have a licensed ebook version?  And secondarily, is there an ebook version in existence whether or not the library licenses it?

This is definitely something in Umlaut’s domain. How well does Umlaut do at answering it?  Currently, the second one ‘does an ebook exist whether or not we license it’, not very well, but if external sources of data (with APIs) could be identified to answer it (as Andy begins doing), plugins to Umlaut could be written to grab those data and make Umlaut’s answer better for this specific use (and perhaps improve other unexpected uses too, since you’ve improved the infrastructural tool).

The first one, does the library have an ebook version, Umlaut does better at, at least at our library.

This works because our library has endeavored to list most ebooks we have in our catalog, and Umlaut tries to do searches of the catalog.But it’s success depends on:

  • We have a record in the catalog OR in our link resolver knowledge base for the ebook. (Umlaut tries to combine both sources of information).
  • Umlaut successfully finds it, which is somewhat trickier than it sounds, since Umlaut uses some heuristic algorithms to try and balance precision (minimize false positives) with recall (minimize false negatives), as well as avoiding duplicate information when data exists in both the catalog and the link resolver.
    • sometimes the ebook record in our catalog has the print ISBN on it too. This will make umlaut’s job easier. Not sure if the SFX knowledge base puts print ISBNs on ebook records.
    • Sometimes Umlaut will do a title-author search of our catalog, but whether it does or not is related to complicated heuristics, which could be tuned for this use case and our data if we put some time into it.

But in fact, it does a reasonably good job anyway. Here are some example Umlaut URLs which take a print ISBN, and tell you “what can the library do or provide for this item”, and the result includes licensed ebooks.  I’ll include a few title-author input too, to show that’s feasible too.

It’s definitely far from perfect, I showed you some succesful positives, finding false negatives would take more time, but I’m sure there in there. (We generally tune Umlaut to avoid false positives, so those are less likely, but there’s surely a few).

Umlaut doesn’t use xISBN or any other “work set expander” service right now, that’d be one obvious improvement, I’d hope to make sometime. Although ideally not before collecting some kind of evidence on how often Umlaut fails for certain tasks in ways that would be improved by a “work set expander”.  There are other data sources and other tunings to Umlaut’s heuristics that could be done.

But I think it shows itself pretty admirably anyway. The point is that Umlaut, as an attempted platform serving as “Known Item Service provider”, is a general purpose tool that can handle this specific use case among many others, and the beauty of a general purpose tool is when you improve it for a certain use case, you get unintended benefits to other use cases you hadn’t yet considered, instead of just having very specific tool for very specific use cases.  I propose that a Known Item Service provider like Umlaut ought to in fact be a key part of an academic libraries infrastructure.


Filed under: General
Categorie: LIS, stranieri

deals

Mar, 13/07/2010 - 01:18

In every deal you get some things and don’t get (or give up) others, that’s what makes it a deal.

HathiTrust recently a welcome expansion of access to public domain texts, within some significant limits if you aren’t a HathiTrust institution.

  • All users can now download full PDFs of public domain volumes that were not digitized by Google. This currently includes nearly 100,000 Internet Archive-digitized volumes that were contributed by the University of California and thousands of volumes digitized locally by the University of Michigan.
  • Authenticated users can now download full PDFs of ALL public domain volumes.

http://mblog.lib.umich.edu/blt/archives/2010/07/hathitrust_digi.html

Seems safe to assume that a contract/license with Google is what prevents them from sharing public domain PDFs digitized by Google with unauthenticated and unaffiliated users.

In fact, it’s kind of suprisingly nice that they can apparently share digitized-by-google public domain texts with the users of HathiTrust institutions, that are not umich, and that may not even be Google partners (you don’t need to be a google partner to be a HathiTrust partner, do you?).

Thanks HathiTrust for actually giving everyone the maximum account you safely can by contract and copyright law (and additionally, not interpreting ‘safely can’ in the absolute most conservative way possible), instead of just saying “Ah, forget it, that’s too hard, let’s only let umich users/HathiTrust partners/Google partners access any of it.”

And HathiTrust is still useful even without full text, the ability to search text even without being able to see more than snippets (or in some cases page numbers) is still useful.  And I’m glad to have a non-profit library consortium in the sector, not just a Google monopoly. If HT wouldn’t really have been possible to start without a jump start from Google, well, that comes with limitations, but it’s a really good platform to start from and the HT folks are doing a good job with it.


Filed under: General
Categorie: LIS, stranieri

who owns cooperative cataloging?

Mer, 07/07/2010 - 07:51

Did you think anyone did? Are we served when an institution does?

A significant change for MSU in the new SkyRiver environment is the inability to contribute to PCC cataloging. MSU was (and still is, in name) a CONSER and NACO library, but the move from OCLC to SkyRiver prevented participation in these activities. SkyRiver has been denied any mechanism for contributing records to the CONSER database (which is embedded in OCLC), and has no mechanism for doing NACO authority work (though as of this writing, the Library of Congress and SkyRiver are in communication about a possible arrangement for NACO).

http://www.ala.org/ala/mgrps/divs/alcts/resources/ano/v21/n2/feat/system.cfm (Thanks to Ed Corrado for the link).

Back when there were multiple regional “bibliographic utilities”, did CONSER and NACO exist yet, or does the OCLC monopoly pre-date them?  I do not know.  Certainly if those formal cooperative cataloging initiatives had existed when there was more than one traditional ‘bibliographic utility’, a library wouldn’t have been required to participate in a certain ‘utility’ in order to participate in the cooperative work. Are the goals of cooperative initiatives like these served by turning down qualified participants willing to contribute cataloging resources unless they subscribe to the monopoly provider?


Filed under: General
Categorie: LIS, stranieri

History of videos, and first sale doctrine (a

Mer, 16/06/2010 - 18:44

Josh Greenberg provides an interesting brief history of video store circulation.

So here’s one interesting thing about that story to me.

“since a VCR owner could watch a purchased movie countless times, individual cassettes were priced at dozens of times the going rate for a box-office ticket.”

Then, as Greenberg recounts, the rental business model took over anyway Or perhaps because of the high-priced retail costs, instead of  ‘anyway’. At any rate, the copyright holders never succeeded in getting much of a market for those high priced retail purchasers, and a rental market developed instead, with the rental stores paying the high prices and then recouping through renting.

As a result of the rental model, the price for actually BUYING a movie dropped substantially, videotapes (and now DVDs) are NOT any longer sold to consumers at dozens of times the price of a box-office ticket, but at maybe 2-4 times, typically.  Presumably because once the rental model took over, it was clear that those higher prices were unrealistic for consumer purchase (really, they always were).

Now one thing I’ve never understood the legal basis of.  Video rental stores (and netflix too?)  still pay a MUCH higher price for copies than the consumer price, closer to that dozens-of-times-the-price.

While this might seem “fair”, since many people are going to watch that copy — I don’t know how (or if?) the publishers are able to _require_ rental stores to do this.  I thought the “first sale doctrine” applied to videos, and I thought the first sale doctrine said once you buy something, you are allowed to rent it out without a license.  (Printing additional copies yourself  like Greenberg says Netflix does still requires a license, which the copyright holders can charge what they want for).

One reason this is interesting to libraries is that the first sale doctrine is exactly what lets us libraries buy one copy of a book and then loan it out without a license to do that. (While some libraries pay special higher “library prices” for books, it’s never been clear to me if this is legally enforceable either; the first sale doctrine should protect our ability to buy a standard consumer copy and lend it out).

Now the first sale doctrine does NOT apply to software, typically.  Which is one reason why libraries are having so much trouble figuring out how to provide e-books to patrons at any kind of reasonable cost. But I know it applies to books, and I thought it applies to videos, which is why I’m confused about why libraries are willing to pay special higher “library prices” for certain books, and why video rental stores are willing to pay much higher rental store prices for purchase of videos.

The wikipedia page suggests that indeed the first sale doctrine applies to videos:

No special new copyright protection was given to movies on video and DVD by the two above amendments, and consequently buyers of retail DVDs in the United States are free to sell or exchange them, and rent and lend them to others.

The wikipedia page also confirms that it does NOT apply, or at least not without confusing restrictions, to “phonorecords” (and audio recordings in general I think), or computer software.  This is mostly because of specific congressional legislation that exempted these categories of things from the first sale doctrine . But not videos. So, still confused.

Very interstingly, the wikipedia article suggests that libraries are exempted from that exemption, and still have first sale doctrine rights for audio and software. So, wait, maybe libraries could legally buy an e-book (or any other software) and lend it out?   Assuming DRM doesn’t get in the way, because violating DRM is illegal in a different way. Phew, these things are confusing, clearly you need a lawyer (which I am not).

I wonder if any libraries are actually having legal counsel investigate this from an aggressive posture.  Probably not, most libraries don’t like to take aggressive legal postures, preferring to just believe whatever a vendor tells them. (Like, that they legally have to pay a high “library price” for a copy?)


Filed under: General
Categorie: LIS, stranieri

Umlaut in Blacklight: Software designed for re-use and extension

Mar, 15/06/2010 - 20:47

One of the goals of Umlaut, for a while now, has been to serve as a piece of back-end library infrastructure, a provider “known item services” in other web applications.

We’ve had Umlaut integrated in our Horizon OPAC, and Xerxes metasearch, for a while now for those purposes.

Now it’s implemented in our demo (prototype, unfinished, very rough around the dges) Blacklight too.

Here’s an example Yes, it loads in via AJAX, a little bit slowly. Once it loads in, from Umlaut are: the book cover, the “limited excerpts” links (yes, these need some styling); and everything over in the left column; the Amazon “summary”.  (assuming we haven’t redesigned the Umlaut page before you read this).

Umlaut has, for a while,  provided some javascript libraries to make it easy to embed Umlaut content for a known item on a page. But for the Blacklight integration, I wrote a new one based on JQuery which is, if I say so un-humbly, very slick.

The new helper should make it awfully easy to embed Umlaut content via javascript on any page which: 1) has an OpenURL on it for your known item, which you can target with a JQuery selector, and 2) for which you can add an external Javascript file or three.

For instance, here’s the Javascript file that loads external Umlaut content onto the Blacklight page via AJAX. You have complete control of where each section of Umlaut content goes, with the full power of JQuery selectors.  And lots of callbacks; for instance here I add a spinner to the page, but take it away when Umlaut load is complete.

The Blacklight end

So how hard was it to get that in Blacklight? Not very. Thanks to lots of hard work, Blacklight is becoming better architected for flexible extension. Still a lot to do, but it’s getting there.

First, I had to get an OpenURL link on the page. (This could be in a hidden element or something, but I wanted to show it).   So how do you get an OpenURL from a Blacklight document?  BL comes with a very very basic OpenURL context-object output from Marc already; but it’s really really basic, just barely good enough for a COinS for Zotero, but not good enough for much else, like this.

So I needed a better Marc-to-OpenURL translator.  So first I wrote one  (not yet released publically, part of my overall Marc mapping/display plugin not yet released publically, but if you want it, I could share it;  it’s just not documented and test-covered yet, and I’m still trying it by fire).

Then adding it in to Umlaut is easy:

# Setup and register extension to provide better Marc to OpenURL mapping, # from the MarcDisplay plugin SolrDocument.extension_parameters[:rfr_id] = "info:sid:library.jhu.edu/blacklight" SolrDocument.extension_parameters[:self_uri_prefix] = "http://catalog.library.jhu.edu/bib/" SolrDocument.use_extension(MarcDisplay::Blacklight::MarcToOpenUrlExtension) do |document| document.respond_to?(:to_marc) end

Since I’ve hooked it into Umlaut in a standard way, now even the COinS for Zotero will use it to get a better OpenURL context object.

And I can easily use it in my custom view to stick an OpenURL link to Umlaut on the page, using some local config I set in a JHConfig singleton object in an initializer.   (Many BL deployers already use a custom ‘show’ view).

<% @sidebar_items << capture do %> <% if( JHConfig.params[:umlaut_base_url] && document.export_formats.keys.include?(:openurl_ctx_kev)) %> <% link_to(JHConfig.params[:umlaut_base_url] +  "/resolve?#{document.export_as_openurl_ctx_kev}" , :rel=>"nofollow", :class=>"findit_link") do %> <img src="<%= JHConfig.params[:umlaut_base_url] %>/images/jhu_findit.gif" alt="Find It @ JH" /> <% end %>       <% end %> <% end %>

If we later index things in Blacklight that are not Marc, and thus don’t (yet) have an OpenURL export, no problem. No exceptions will be raised, the OpenURL link simply won’t show up on the page, and the subsequent Umlaut AJAX code won’t be triggered.  You want it, all you’ve got to do is write some code to translate the new format to OpenURL, and then package in an Umlaut document extension, and add in the extension in your initializers. Beautiful!

Once this OpenURL link is on the page, the rest is javascript. Include a few standard support JS files from Umlaut, and include the Javascript file I linked above for loading and mapping from Umlaut to your page DOM. That js file will have to be written for a local app, since it’s about mappings to the DOM that are going to be different for every app, since people like to re-write  their ‘show’ views.  But the JQuery helper makes it pretty straightforward to write the mapping, I think.

Here’s my code in an initializer which ensures the proper Javascript is injected into the Umlaut output.

if (JHConfig.params[:umlaut_base_url]) setup_umlaut = lambda do |controller| # include Umlaut JQuery object for updating page with umlaut stuff controller.extra_head_content << "<script src=\"#{JHConfig.params[:umlaut_base_url]}/javascripts/jquery/umlaut/update_html.js\" type=\"text/javascript\"></script>" # Include Umlaut js object for loading Umlaut js behaviors controller.extra_head_content << "<script type=\"text/javascript\" src=\"#{JHConfig.params[:umlaut_base_url]}/js_helper/loader\"></script>" # Include our JS to actually update page with jquery stuff controller.javascript_includes << "umlaut_include" # local CSS for embedded Umlaut content controller.stylesheet_links << "umlaut_content" end CatalogController.before_filter setup_umlaut, nly => :show end

(If I had to do it over again, I think I’d put the worker method in an actual method that gets “include’d” into the CatalogController instead of an anonymous lambda, but it works.)

One more part

In our Horizon OPAC, Umlaut-OPAC integration is a two way street. Not only is Umlaut content embedded in the OPAC (done in Blacklight), but Umlaut _querries_ the OPAC for full text links and physical holdings to put on the Umlaut page. (and in fact it’s kind of circular — OPAC asks Umlaut for content, Umlaut querries OPAC and delivers content back to OPAC including some from external sources and some actually from the OPAC, possibly other records).

That part isn’t done yet, Umlaut needs a good ‘api’ to query Blacklight.  Half of that is the super neato Atom response I added to Blacklight, with alternate format (like Marc) discovery in the feed. The other half is the external client, like Umlaut, needs a way to specify queries — the queries Umlaut normally uses for that are too complex to fit in BL’s ordinary interface (or even in the prototype Blacklight ‘advanced search’), like “  isbn = X OR (author = Y and title = T)”.

So for that half, I’m writing a CQL plugin to blacklight, that will let CQL querries be given to Blacklight, and result in HTML or Atom or what have you results.

Work in progress.


Filed under: General
Categorie: LIS, stranieri

new version of cql-ruby

Mar, 15/06/2010 - 17:00

cql-ruby is a ruby gem for parsing CQL, and serializing parse trees back to CQL, to xCQL, or to a solr query.

A new version has been released, 0.8.0, available from gem update/install. “gem install cql-ruby”.

The new version improves greatly on the #to_solr serialization as a solr query, providing support for translation from more CQL relations than
previously, fixing a couple bugs, and making #to_solr raise appropriate exceptions if you try to convert CQL that is not supported for
#to_solr. See:
http://cql-ruby.rubyforge.org/svn/trunk/lib/cql_ruby/cql_to_solr.rb

That’s the only change from the previous version, improved #to_solr.

I wrote the improved #to_solr, Chick Markley wrote the original cql-ruby gem, which was a port of the Java CQL parsing code by Mike Taylor. Ain’t open source grand?

The reason I’m working on this is to provide cql query input to Blacklight, so I can have an “api” to Blacklight (cql in, atom out) that can be used by Umlaut. Making that work well in Blacklight takes a few steps beyond what cql-ruby to_solr does, but I’m nearly done with that too, more news as warranted.


Filed under: General
Categorie: LIS, stranieri

LCCN assignment error?

Mar, 15/06/2010 - 01:12

I can’t give you a “deep link” to search results, but go to catalog.loc.gov, choose “guided search”, select “LCCN number” and search for: 48006847

You get two results with the same LCCN, no? Is this an accidental LCCN collision? Is there something I’m not understanding?

The fact that one of them has an ‘a’ on the front does not prevent this from being a collision, because under LCCN normalization rules, those two strings end up being the same LCCN.

Any explanation? Something I’m missing? Anyone know any way to report this to LC?


Filed under: General
Categorie: LIS, stranieri

note to self: more ideas for browse search in solr

Sab, 05/06/2010 - 06:34

Mostly as a note to myself, but share it in case it makes any sense to anyone else.

In the back of my mind, I’m continually thinking of how to implement a traditional opac ‘browse search’ in solr. Solr isn’t really quite designed for this. Mostly the back of my mind has been trying to figure out how to do this with the solr features already there.

But late tonight now, I figured, eh, maybe I understand Solr enough to try and dive into the solr code, and get the back of my mind thinking about how to actually hack the feature into solr directly.

Traditional browse search let’s you a ‘start with’ query on a list of “headings”. Those ‘headings’ generally end up as facet values in most people’s solr implementation.

Ideally, it would improve upon traditional browse search, in letting you do a browse search with “filters”, ie searching through the headings only including headings attached to bibs that have been filtered (bibs in a certain physical library, say).

So there are _several_ logic paths solr can take to do facetting, depending on which solr.method type you choice, whether the field is multi-valued or single-valued, possibly your facet.sort, and maybe some other factors.

I figured I’d focus on the path I actually need: facet ‘fc’ method, on multi-valued fields, doing a facet.sort=index, and with a facet.limit set to a positive integer. (And NO facet.prefix set).

The outcome I want? Well, start with the idea of the built in facet.offset. I want to do something that’s kind of like that, but I don’t know the offset I want yet, I want solr to figure it out for me based on a prefix. Instead of facet.offset, , I’m going to give, well, I’m making it up, so let’s call it facet.offset_from_prefix . For facet.offset_from_prefix=X, I want solr to figure out the offset that would put the FIRST facet beginning with X as the first value in the facet set — or if there is no facet value beginning with X, then whatever facet value is alphabetic sort closest to X. Then I want to continue as if this was actually specified as a facet.offset, returning facet values starting from there. AND I want the eventual solr response to the client to _include_ this calculated offset (so the client can page forward and back if it wants).

.For the conditions we set above, i think the control-flow path will lead us to: SimpleFacet#getTermCounts, which will get an UninvertedField for our facet field, and then call UninvertedField#getCounts on it.

If we look at UninvertedField#getCounts , an interesting part is the logic for handling facet.prefix. Now, facet.prefix is not what we want, because it changes the overall set of facet values returned. We don’t want to change the overall set, we just want to find the correct _offset_ for a prefix, within the overall unchanged set.

Okay, but look at what facet.prefix does: It FIRST _does_ find exactly the offset we want, by using NumberedTermEnum#skipTo/getTermNumber. Aha, this just showed us how to do what we want to do in solr. (We just don’t want to do the NEXT part of what the facet.prefix handling logic does, reset the overall facet value list’s “0″ offset to this found offset).

So we just need to get UninvertedField#getCounts to accept a facet_offset_prefix param (and change everything up in it’s calling chain so that’s passed to it from the url params). And then, when such a thing is present, use that NumberedTermEnum logic to get the offset we want — and SET the variable that holds an explicit offset that would have been passed in by the user to this found offset — that’s it, now let the rest of the Solr logic continue as normal. (Perhaps raise an exception of some kind of conflicting params were passed in — for instance, this this facet_offset_prefix is kind of incompatible with an ordinary facet.prefix. ).

Now the facet values will returned will be right for our spec. The only thing that remains is figuring out how to _echo back_ the looked-up offset to the client, in the solr response. I have no idea how to do that, but trust there should be a not too hard way to modify SimpleFacet to include an extra xml element or attribute in it’s responses, which is I think what would need to be done.

So… I totally don’t actually understand what I’m talking about… but I still think I’ve figured out a decent plan.

If anyone actually has any idea what I’m talking about (the intersection between people who understand the solr code, and people who read my blog, may be 0; and on top of that, talking about code in narrative is inevitably confusing, and I’m not sure if this post is actually comprehensible by anyone)…

Does this actually sound like it just might work?

Is there an obvious reason the performance of this will be crap? By basing it on logic already used by SimpleFacet depending on your arguments, I figure it should perform just as well as, well, the equivalent facet.prefix and/or facet.offset querries already would. But if someone who actually understands Solr sees an obvious performance problem, let me know.

While the amount of code that has to be changed is actually fairly minimal, it might effect a buncha classes, since I need to get my new parameters passed all the way down the call chain to the right place, and then get the calculated offset passed all the way back up to make it into a response. Is this going to be a big pain in the butt custom fork/patch version that will be hard to maintain in parity with continuing Solr developments? (Certainly if the implementation of SimpleFacets#getCounts or UninvertedField#getTermCounts ever changes significantly, the patch would have to be entirely rewritten).

Assuming it actually does work, wonder if there’s any chance of getting a patch like this into solr main stream.


Filed under: General
Categorie: LIS, stranieri

missing rails api: render_with_format

Gio, 20/05/2010 - 20:15

So in newish versions of Rails, if you say:

render(:partial => “foo”)

And you are current rendering format “html”, then rails will look for a template called “_foo.html.[erb|builder]“, and failing that look for “foo.[erb|builder”.  If your in some other format, than it’ll look for that in place of ‘html’.

So what if you are rendering an ‘atom’ format feed (or really any other XML format), and your going to put some html in the atom:summary (for example), and you want to render a partial to do it?  render(:partial => “foo”) is not going to find _foo.html.erb, because it’s going to look for _foo.xml.erb, and then _foo.erb, and then throw an exception when it finds nothing.

So one solution is to be explicit in your render call: render(:partial => “foo.html.erb” ).

That works. But what if _foo.html.erb itself calls OTHER partials, and just uses the shortcut name to call them? Now THOSE calls will raise exceptions.

So you could use the full .html.erb version every single time you use a render anywhere, just in case (never can predict what you someday might want to call from XML). But that’s kind of ugly, and what if you’re writing framework/plugin/library code (like for Blacklight atom generation) you want to be easily callable by everyone else without having to put weird restrictions like that on them?

It seems that Rails render really needs a :format option to force a certain format for that render call. But it sadly does not have one.

But in a few lines of Rails code using internal Rails API (that, yeah, could break in a future version), you can give it one.  With the info on the internal Rails API from James A. Rosen found via google, there are two ways to do it. James more flexible way:

def with_format(format, &block) old_format = @template_format @template_format = format result = block.call @template_format = old_format return result end

Alternate idea of more constrained-to-render-usecase way:

def render_with_format(hash) format = hash.delete(:format) original_format = @template_format @template_format = format begin render(hash) ensure @template_format = original_format end end
Filed under: General
Categorie: LIS, stranieri

atom syndication spec contradicts itself?

Mer, 19/05/2010 - 22:41

Am I right, or am I misunderstanding? I cant’ be the first to have ntoiced this.

4.1.1.1 ….It is advisable that each atom:entry element contain a non-empty atom:title element, a non-empty atom:content element when that element is present, and a non-empty atom:summary element when the entry contains no atom:content element. However, the absence of atom:summary is not an error, and Atom Processors MUST NOT fail to function correctly as a consequence of such an absence.

4.1.2atom:entry elements MUST contain an atom:summary element in either of the following cases:

So which is it, summary is required when there’s no text or html content, or summary is just recommended but never required?

I hope just recommended, because I’m not sure I have a good summary available.


Filed under: General
Categorie: LIS, stranieri

idle thoughts: timeline visualization in a catalog

Mer, 19/05/2010 - 03:53

So I’ve been thinking for a while about visualizing time distribution in an OPAC view. Things in our catalog generally have a year they were published, or a range of years for a serial; and sometimes are about a particular time period too.

The MIT Simile timeline widget is one way to do this, and the way I’ve heard people think about using. But I can’t figure out how the timeline widget could scale very well to a set of thousands or hundreds of thousands or millions of ‘points’ — either visually or technologically. And I’m not sure how flexible it’s javascript api is for tweaking the way we’d want to customize things for our use case — simile seems to have a lot of cool features for when you have points in time that are very granular (days or even seconds), which isn’t really our use case here. (Although they do have an example of a somewhat less granular data set. I find that example somewhat klunkier than their front page example though — and it still doens’t have all that many data points on it. ).  But Simile  is certainly one option.

The I noticed that the new google interface has made more prominent it’s own timeline visualization. This one is a bit more suited for low-granularity data like years (although it will also display high-granularity time data). But…. it’s really pretty clunky. You can click on a division from that timeline to ‘drill down’, but it gets kind of confusing, seems to me, when you do that.  I’m honestly kind of surprised that Google couldn’t/didn’t do better. (Maybe they were trying to absolutely minimize the javascript required?).

(Also, Google seems to mix together date of web page publishing as data point with dates mentioned in the web page as data points, which seems kind of an odd choice, but that’s a different topic, here I’m mostly thinking about interfaces for visualizing a timeline of dates, not how you choose what dates to put on the timeline).

But thinking of how I might duplicate the google-style timeline visualization, I went searching for JQuery plugins (or other javascript libraries) for timeline visualization, that could achieve what google does, more or less.

What I wound up finding was flot.  Which is not for timeline visualization specifically, it’s a general purpose data visualization jQuery plugin. And man is it super neat! Incredibly powerful and flexible, but with a very simple concise and easy to use to API, and incredibly slick looking visualizations too. It’s super neat!  (I think a good principle of any kind of API design (or really any kind of system design at all) is that simple things should be simple to do; more complicated things can be more complicated to do, although should still be as simple as you can make them. Flot does well here).

Imagine this type of visualization (seriously, click on that link, it’s pretty sweet)  of catalog timeline data. I like the two linked charts (overview, and zoom-in; similar to the Simile version and what Google kind of sort of klunkily does), and you can make selections in either one (click and drag to make a selection; also drag-panning). And view source to see how amazingly few and simple lines of JS were required to draw that, wow!

Just add some labelled vertical lines (which flot is quite capable) of.  Now, when you make a selection, you could get an immediately changed list of bib results in another part of the screen (bottom or side).  And/or, when you mouseover (or click) on a particular year (or range, depending on zoom level), you could get a pop-up window listing the bibs in the time you clicked on.

Totally do-able with flot. Wow, flot is neat!

It’s not entirely clear to me how you’d deal with items that have a range of dates instead of one particular date in that visualization though. (Like a serial, or a book about the 18th century). An ‘item’ with a range instead of a fixed date is one thing that the Simile widget is set up for, but neither the Google version nor any of the flot examples show. But if you can think of how to do it visually, I bet flot is probably flexible enough to let you do it.

Maybe some day I’ll get to play around with that. No day any time soon I don’t think, sadly.  Sometimes I feel like I am continually building the basic boring parts of my systems to bare level of competence — and just when I think I’ve got that done and can finally start doing some really cool stuff on the platform I’ve built, nope, there’s a different system that I’ve got to work on getting to the level of basic robust competent platform. Oh well, some day.


Filed under: General
Categorie: LIS, stranieri

Federated Search: Users might actually like it

Gio, 13/05/2010 - 22:40

There is LOTS of skepticism toward federated search from librarians and library staff.  And indeed I agree that even the best library-oriented federated search solutions I’ve seen are awfully kludgey in many ways. (By “library-oriented” I mean oriented toward finding citations to (generally scholarly) publications, mostly articles.)

However, I believe that some form of inter-mediated meta-search is neccessary to meet certain patron needs we have, and I’ll explain why. But first, some anecdotal verification for my belief.

We’ve deployed Xerxes here at my place of work, a much better interface on top of the Metalib broadcast federated search engine.  The actual Metalib search engine is unchanged, you still get the same results you would from Metalib, no better. But they are presented in a much more usable interface.

Despite these improvements, many librarians here are still highly skeptical of our JHSearch federated search service, and reluctant to show it users.

But Christina Pikas, despite her reservations, decided to at least mention it’s existence at a recent library orientation she did for a particular disciplinary unit of researchers.

And shockingly, a few users tried it out, and liked it enough that they, without prompting or solicitation, sent her rave reviews.  One user went so far as to send Christina a screenshot demonstrating how JHSearch found the article he wanted, and got him to fulltext. Another user said, get ready for it, “Much better than google.”

Much better than google? I don’t know about that, but in some contexts, depending on what you’re looking for and what you want to do with it, sure, definitely.

An important point here is that, while librarians might want users to use only native platform interfaces from our licensed databases, they are not going to. They are not going to learn dozens of different (often clunky and confusing) vendor interfaces, and perform multiple searches (on multiple platforms) for every query.  Even sophisticated faculty searchers.  They might learn one (or maybe two) native vendor platforms, that’s typically about it.

So they’re going to go to Google.  Which often works, but has some problems as well. Google (and even Google Scholar) aren’t that great at getting users to licensed fulltext, even when your library does license fulltext for an article the user finds on Google (or Scholar).   Google Scholar (and especially Google) are kind of grab bags of content; they have a lot, but for scholarly research, depending on your search and needs, the kind of things you’re actually looking for may be drowned out by noise, and there may be lots of content (much of which the library has licensed fulltext for) which are not in there at all. It’s hard to say exactly what’s in there, we have no control over or really much information about what’s in Google, and no service agreements with them.

But Google is fast, and easy to use. Then we have our licensed vendor platforms which in some cases are fast and easy to use, in some cases aren’t, but typically offer more powerful searching tools than Google (or broadcast federated search like JHSearch).  But are also multitudinous, requiring a researcher to do multiple searches in multiple interfaces if they want to take full advantage, and they aren’t going to do that.

Then we have the library-provided broadcast federated search. It’s (even in the best implementations I’ve seen) slower and klunkier than Google, but (if you make the interface as good as you can, like with Xerxes), easier to use than the aggregate collection of our multitudinous vendor platforms.  It probably doesn’t cover as much content as if you were to search every single licensed vendor platform (I have not seen any academic federated search deployment that does), but for many (not neccesarily) searches it will offer a better, both more complete and more focused, collection than Google.  And it’s the only option of these three that the library actually has control over, to improve the interface to try to meet local user needs.

Each of these options has pros and cons for the user. I wish we didn’t have to present the user with so many options, and could just give the user a tool that would work in a variety of contexts and needs, but the technological and business environment just doesn’t make that possible right now.  I continue to be of the opinion that the library providing some form of “multi-vendor content search” like broadcast federated search is a crucial tool for us to supply for our users search toolboxes.

Now, I continue to be very interested in the “aggregated index” solutions like SerialSolutions Summon and Ex Libris PrimoCentral that are appearing in the academic/scholarly research market.  I think they have a lot of promise to hit most of the benefits of broadcast federated search solutions while reducing a lot of the problems with broadcast federated search solutions.

These aggregated index solutions could very well become a better option than broadcast federated search for meeting this space in the middle of licensed vendor platforms and Google:  An interface under library control, crossing publisher and aggregator vendor boundaries in a single search, but more focused/targetted content for scholarly search than Google, and with better connections to licensed fulltext and other library services (like ILL).

I haven’t had a chance to investigate either of these aggregated index solutions exhaustively, I’m not sure how they’d realistically stack up against broadcast federated search for an academic instution, but the concept definitely has promise. But they are still not going to be able to offer as sophisticated search tools as licensed vendor platforms — nevertheless, one way or another we need to meet this “middle ground” need, and they have the promise to meet this need while improving on the user experiene of broadcast federated search like Metalib, we will see.

http://scienceblogs.com/christinaslisrant/
Filed under: General
Categorie: LIS, stranieri

Unicode normalization forms

Gio, 13/05/2010 - 22:16

So I didn’t even know anything about Unicode normalization before I had to learn it to debug my RefWorks issues, but it ends up mattering for a whole bunch of other things not related to RefWorks. An esoteric issue which it actually does pay to know about.

You can check out the official Unicode documentation on Unicode Normalization Forms.

Basically, in any given unicode encoding, say UTF-8 (but equally true for any encoding), there can be several ways to encode any given glyph on the screen.

For instance, a lowercase e with an acute accent can be encoded as a single unicode codepoint for the lowercase e with an acute accent, or can be encoded as two unicode codepoints, a lowercase e followed by a combining diacritic acute accent. It gets even more complicated than this, when you recall that a single latin character can theoretically have multiple diacritics applied to it, there can in fact be more than two ways to encode some glyphs. And then we get to non-Latin alphabets, which have their own “composed” (as few unicode codepoints as possible) or “decomposed” (multiple codepoints in a row) alternatives, which I didn’t even know about until I read the report above.

Why does it matter?

So anyway, the most obvious place this matters is when you’re comparing two unicode strings to see if they are “the same”.  And the Unicode Normalization Form report above is written mainly in terms of that use case. And that use case matters, for instance, if you are indexing unicode in Solr, and you want a string with one unicode encoding to match in the index a string that is really the ‘same thing’ in another encoding. And there are a variety of possible approaches to do that.

But that’s not what I’m going to talk about.

It turns out that unicode normalization forms seem to matter for display too.  I have found that both Firefox and IE on Windows (at least) will end up displaying decomposed unicode, well, screwily.  For instance, many decomposed forms, if you try to put them in a browser title bar with html <title>, seemed to end up being displayed just as blocks, rather than their proper characters.  In the browser window itself, decomposed unicode forms faired better, but still often seemed to be displayed in a variety of kind of screwy messy ways (diacritics not lining up properly with the letters they applied to, etc.).

Making sure all the unicode was in NFC (“composed”) form before displaying it in the browser seemed to result in significantly better display.

One example

Here’s an example in FF3 on Windows, ins ome particular font, yeah, might be different at different fonts and sizes. I have no idea if this is in fact a correct way to write this word according to any system, but it’s how it is in my database, it’s got an “i” which is suppose to have both a horizontal bar AND an acute accent over it, somehow.

NFC form:

Non-normalized decomposed form:

Actually, here they are right in the browser as text, how does your browser display these, is one better than the other?

NFC form: Sharīʻat and ambiguity in South Asian Islam

Non-normalized decomposed: Sharīʻat and ambiguity in South Asian Islam

(Sometimes it gets worse than this too, this is just an example I had at hand. Also in this case, BOTH ways do NOT display correctly in the Firefox title bar, when I try to put the UTF8 in an HTML title, although they display wrong in different ways! Oh well. I guess window title bars have additional limitations or bugs, perhaps OS-level? I have seen other cases where NFC displays correctly in title bar, but non-normalized decomposed does not. And in this case, BOTH display fine in firefox tab title, even though not in the browser window title bar! Go figure. )

W3C Recommendation

And indeed that Unicode report above suggests that:

The W3C Character Model for the World Wide Web, Part II: Normalization [CharNorm] and other W3C Specifications (such as XML 1.0 5th Edition) recommend using Normalization Form C for all content, because this form avoids potential interoperability problems arising from the use of canonically equivalent, yet different, character sequences in document formats on the Web. See the W3C Requirements for String Identity, Matching, and String Indexing [CharReq] for more background.

When to do it?

If you are starting from Marc records in Marc8, and using Marc4J to convert them to UTF-8, they will NOT wind up in NFC by default, they’ll wind up in a decomposed form (which may or may not be Normalized Form Decomposed, I’m not sure if it always is, but it generally is).   If there’s any other Marc8 to UTF8 converter around other than the Java one Marc4J uses, I wouldn’t be surprised if it does similar, this is the most obvious (and only reliably round-trippable) way to convert from Marc8 to UTF8, since Marc8′s method of representing non-ascii characters is analagous to “decomposed” unicode.

So, in a Solr discovery layer type application, there are a variety of places you could do the normalization. You could do it at indexing time, before anything is added to the index, so everything that goes into the index is in NFC. Or you could do it in your app, after you pull things out of the index, but before you send them accross the HTTP wire to a browser. And there are probably a few different control points in your application you could do this, at different levels.

I decided just doing it as early in the data chain as possible made sense, just get it done at the root, don’t worry about it again. So that’s at the indexing stage.

There’s probably some way to get Solr itself to do this, regardless of what unicode you throw at it. But you’d probably have to make sure you configure every single Solr field in your schema to do that, and you might want to do it differently for indexed vs stored fields (maybe NFKC for indexed vs NFC for stored), and I haven’t quite figured out how that works at the Solr end yet.

How to do it?

Or you can just have your indexing application do it before it feeds things to Solr. If you’re using SolrMarc, then the SolrMarc 2.1.1 release (currently a tag in the svn repo, but not yet a downloadable binary release) offers this option:

marc.unicode_normalize = C

In my case, my incoming Marc is in Marc8, and I’m having SolrMarc translate to UTF8 (via Marc4J), and this flag tells it when it does the translation also apply NFC normalization rules. I’m not entirely sure if that config would still be used by SolrMarc if your incoming Marc were in UTF8 to begin with, but you still wanted to make sure to NFC normalize it before adding to the index.

If you find yourself having to write your own code to do this normalization, it can be done pretty easily in most languages. In modern Java versions, there is a built in class.  If you are stuck in Java 1.4 as I am for a certain application, there’s icu4j, which I used no problem in my Java 1.4 app. (Also C/C++ libraries available there). In ruby, there’s the ruby unicode gem (which is a C-compiled gem, not sure if it’s based on the icu libraries or not), which I am also using no problem in a Rails app. (For some reason the simple methods I’m using don’t show up in the unicode api docs: Unicode.normalize_C, Unicode.normalize_KC, etc.).


Filed under: General
Categorie: LIS, stranieri

More refworks diacritics

Mar, 11/05/2010 - 21:12

I previously reported that even though it’s not documented anywhere and RefWorks support couldn’t tell me it, RefWorks had a problem importing UTF-8 including “decomposed” characters, and the solution was to apply Unicode Normalized Form C to your data before sending to RefWorks.

But it gets trickier. I can’t even begin to imagine how this can possibly be so, but my experiments seem to indicate…. while Refworks is normally fine with having an import callback URL be an https url….   for certain UTF-8 data (but not all UTF-8) data, if you normalize it first with NFC, then send it to Refworks import… it works if your callback URL is http, but not if your callback URL is https.

I seriously can’t even begin to imagine why this should make any difference to the RefWorks software.  I am not happy with RefWorks right now.

Here’s the letter I just sent to RefWorks:

I determined, as you recall, that taking my UTF-8 data and applying Unicode Normalizing Form C to it generally makes it import properly.

Also, in general Refworks import can normally accept a ‘callback url’ that is https.

However, for some reason _certain_ (but not all UTF-8) data, even when put in NFC form, still causes Refworks to error — but only when the callback is https, not when it’s http.

I can think of no good reason https vs http ought to make any difference at all on your end.

Compare:

Works fine:

http://www.refworks.com/express/expressimport.asp?vendor=JH%20Libraries&filter=MARC%20Format&encoding=65001&url=http%3A%2F%2Fblacklight.mse.jhu.edu%2Fsamplemarc%2Frefworks-error-c.txt

Produces Refworks error:

http://www.refworks.com/express/expressimport.asp?vendor=JH%20Libraries&filter=MARC%20Format&encoding=65001&url=https%3A%2F%2Fblacklight.mse.jhu.edu%2Fsamplemarc%2Frefworks-error-c.txt

Other imports without diacritics work fine.

I know you guys in support can do nothing about this, but these undocumented RefWorks bugs are becoming increasingly frustrating and time-consuming for me, this is really unfortunate, and significantly increases our Total Cost of Ownership for Refworks.


Filed under: General
Categorie: LIS, stranieri

google accounts driving me crazy

Gio, 06/05/2010 - 02:25

My google account(s) have been driving me crazy for a while. I think I found a hint as to why:

Why am I seeing the message “Oops. A calendar already exists…” when I access Google Calendar?

If you’ve recently signed up for Google Apps with the same email address associated with your Google Calendar, you’ll be redirected to an error message page when accessing http://www.google.com/calendar

It’s not possible to have a Google Apps Account and a non-hosted Google Account both using the same email address. Don’t fret though, your non-hosted Google Calendar hasn’t disappeared forever. You’ll just need to make some changes in order to access your Google Calendar.

To gain access to your non-hosted Google Calendar, go to the Google Apps for Administrators Help Center here and follow the steps to change the email address associated with your Google Account to something not containing your domain name or an existing Google Account. Once you’ve changed the email address, you should be able to access both your Google Apps Calendar and your non-hosted Google Calendar normally.

Okay, the problem statement matches me. But I don’t understand their solution at all. They want me to change the email address of…. which of my Google Accounts? I think the trick is I’ve somehow ended up with TWO Google Accounts, but both with the same email addr username. Which one I’m logged in to at any given time is anyone’s guess.  The help center links just links the top level help page, I have no idea what instructions they mean in particular.

Man, Google, you are awful sometimes.



Filed under: General
Categorie: LIS, stranieri

Notes on the concept of “preferred access point”

Mar, 04/05/2010 - 15:33

On the NGC4Lib listserv, Neil Godfrey writes, and I respond:

Apols if I have missed any previous explanation of this, but I am wondering what reason/s lie behind RDA continuing the concept of “a preferred acces point” in cases of multiple authors for a work.

Is the reason primarily to accommodate the contingencies of MARC-based cataloguing? Are there other reasons such as data exchange and identification or other?

What difference/s does “a preferred access point” make in online databases and user interfaces?

This is just my interpretation….

I think “preferred access point” is a really bad choice of term. What they should have called this is “Citation Heading” or something like that, something involving the word “citation” or perhaps “reference”.

It’s purpose is so you can “cite” or “reference” a different record in (for example) a 700 name-title. In order to do that traditionally, you would use the “main entry” heading. You need to be able to put a certain string in that field that can unambiguously identify a referenced/cited record. This is the purpose of the “preferred access point”, and the only purpose I can see.

You could think of it almost like a modern “foreign key” to relate one record to another.

Now, in 2010, the _better_ way to do this kind of “citation” or “reference” is with an actual controlled identifier (an accession number, a URI, etc). (You know, more like the way “foreign keys” actually work).

I wish that RDA made it clear that this is _preferable_, and allowed you to use _only_ an identifier when available. I am not sure if it does. But even if it did, I think it is a good idea to — as our legacy practices always have — allow this kind of reference/citation using a controlled “heading” instead of an actual modern identifier, for backwards compatibility purposes if nothing else, but I’m not sure it doesn’t have other utility as well.


Filed under: General
Categorie: LIS, stranieri

problems importing diacritics into RefWorks

Mer, 28/04/2010 - 15:59

So I am in an ongoing war with RefWorks (the software, and a little bit the people) to get non-ascii chars (in this case, simply latin alphabet with diacritics) imported succesfully into Refworks.

I am using the RefWorks import filters, and using the RefWorks “marc” filter, which doesn’t actually take marc, but takes a weird unique-to-refworks marc-in-plain-text format.  But which seems to work — until you get to diacritics.  I have this set up to export from my catalog to RefWorks.

UPDATE: Solution/answer at bottom of post.

I am curious:

  1. Has anyone else had a problem with diacritics in export to RW?
  2. Has anyone found a solution, or discovered anything more about the problem than RefWorks Support is able to tell me (which is pretty much nothing).
Our story so far

So for at least a year my users have been complaining about this. And sometimes I could reproduce their problem and sometimes I couldn’t. And when I could, sometimes I’d report it to RefWorks.

And RefWorks would tell me “Your data is not in UTF-8, we only support UTF-8″.

I was suspicious of this — I thought my data was in UTF-8. But you know how confusing debugging char encodings is, I didn’t really have time for it, so I let it be.

The chase is on

But my users were getting more and more restless about this, it is a serious problem for them. And recently were able to provide me with two clear reproducible test cases:  One in which diacritics are messed up by RefWorks upon import (they become detached free-floating, instead of being above teh chars they should be above) ; and another in which RefWorks refused to do the import at all, producing an error message instead.

Example one: Refworks imports diacritics improperly

My marc-in-plaintext file which I believe to be in UTF-8:

https://catalog.library.jhu.edu/mods/?format=marc&bib=2663421

The refworks import URL referencing this URL:

http://www.refworks.com/express/ExpressImport.asp?vendor=Johns%20Hopkins%20University&filter=MARC%20Format&encoding=60051&url=https://catalog.library.jhu.edu/mods/?format=marc%26bib=2663421

Example two: Refworks produces error

My marc-in-plaintext file which I believe to be in UTF-8:

https://catalog.library.jhu.edu/mods/?format=marc&bib=1144347

The refworks import URL referencing this URL:

http://www.refworks.com/express/ExpressImport.asp?vendor=Johns%20Hopkins%20University&filter=MARC%20Format&encoding=60051&url=https://catalog.library.jhu.edu/mods/?format=marc%26bib=1144347

Investigation

So I spent some time — with the invaluable help of dbs, gmcharlt and others on the #code4lib IRC — invesgating these source files byte by byte, to make sure they were valid UTF-8 representing what they should represent.

And they seem to be to me — as far as I can tell, these are perfectly valid UTF-8 representing what it should represent.

Now, they do use unicode combining diacritics — they are in ‘decomposed’ form. More on that later. But that’s perfectly legal UTF-8.

If anyone reading this can look at thise source files and find any problem with them, let me know.

Those files are also returned by my server with proper HTTP headers indicating they are UTF-8 (this  is right, right?):

Content-Type: text/plain; charset=UTF-8

Additionally, I checked in our demo blacklight instance, which has completely different logic implemented by different code for going from our Marc8-encoded Marc to UTF-8 encoded refworks marc-as-plain-text.  And RefWorks has the identical problem with the export from our demo blacklight.

Sally forth

So I prepared another email to RefWorks support, this time insisting that my files were UTF-8. My email included hexidecimal representations of bytes, and unicode code points those bytes mapped to in UTF-8.   My email was fairly concise, but I wanted to provide them with technical details they couldn’t simply dismiss with “your data is not UTF-8″.

And at first it worked — RefWorks support “escalated” my issue, and eventually gave me an answer. That it became clear they didn’t test or try out themselves at all, they were just pulling answers out of a hat.

They told me that instead of using the RefWorks “Marc Format” input filter, I should use the RefWorks “Marc Format (UTF-8)” import filter. Which at first made a certain amount of sense — except for the fact that the RefWorks import URL already included “&encoding=60051″, which is documented to mean UTF-8 in the first place. And the fact that for a year they’d been telling me “Marc Format” filter already (and only) supported UTF-8.

But still, of course, I tried it.  It did not help. The record that produced a RefWorks error message still produced an error message. The record with messed up diacritics still had messed up diacritics, but now also had the wrong information in the RefWorks fields. (I suspect the “Marc Format (UTF-8)”  filter assumes some European Marc format, rather than Marc21 — something I asked them before trying it, but they insisted it used the same marc-field-to-refworks-field mapping as “Marc Format”, which turned out not to be true.)

So I reported back to RefWorks that this didn’t work.

Guess what they’re response was?  They went back to telling me my data was not UTF-8.  Now they are suggesting my data is really in ISO 8859-1, and I need to convert it to UTF-8 if I want it to work with their software.

I’d be happy to convert it to UTF-8 — except as far as I can tell, it already is! It is not ISO 8859-1.   If there is a problem with it’s UTF-8 encoding that I have not figured out (which is quite possible, and if you see one please let me know), they need to actually tell me what it is, not just keep insisting my data is not in UTF-8 and needs to be. I have spent quite a bit of time trying to confirm that my data really is UTF-8, and believe I have done so. As far as I can tell, they have spent little time doing anything but suggesting solutions to me they didn’t even try themselves first and just pulled out of a hat, and repeating their default “your data is not UTF-8″ claim.

My suspicion

Now, here’s my suspicion. I believe my data is valid UTF-8.  But it uses combining diacritics, it’s in “decomposed” form.  I have a hunch that the RefWorks software can’t handle this, it requires composed normalized form. The particular way the diacritics are messed up kind of suggests this (the diacritics on import become free-standing punctuation AFTER the char they are supposed to be over).

Even if this hunch is true, I have no idea if that would also fix the problem with the record RefWorks simply produces an error message for.

The thing is, for local weird reasons, it’s harder to change my software to do this than it ought to be. (It’s open source software in Java written by someone at another institution that I inherited when I started my job here; I don’t believe I have a copy of the source, just the .jar.  The source is probably floating around out there, cause others have used it, but doesn’t appear to be on the public web currently).

So I really don’t want to embark on that non-trivial task until I get confirmation from RefWorks of what their specifications are, so I can meet them. That’s all I ask. But it’s pretty clear RefWorks does not know the specifications/requirements of their software. Okay, so, I believe, they now have to figure them out. It’s what we pay them for, right?

My frustration

Character encoding issues are really complicated to deal with.  Char encoding debugging is definitely the most challenging, frustrating, brain-twisting sort of debugging I ever have to do.  But that’s how it goes, it still has to be done sometimes.

So I don’t blame RefWorks for finding them confusing too. My frustration is that RefWorks doesn’t seem to agree it’s their responsibility to figure them out. If they’re so confusing, then they need to give us customers clear specifications/requirements, so we can work on meeting them — instead of leaving each customer individually to “reinvent the wheel” of trying to reverse engineer RefWorks to figure out their specifications without having access to the source.   This is what we pay RefWorks for, providing support, right?

Of course, I guess they think they’ve done this, and their specs are “UTF-8″. The problem is, I have data I’ve spent significant time analyzing to be sure it’s UTF-8, and I am as sure as I can be, and they just keep insisting it’s not. The ball is in their court.

Next steps

So in addition to waiting to see what RW says next (I am not optimistic), I  might try individually translating those two files to UTF-8 normalized composed form, and seeing if it fixes the issues with one or both of them. And if it does, I guess I have to attack the non-trivial task of recompiling my software to do this normalization.  But it would be frustrating because I still won’t know if my software meets their software’s requirements, because they can’t tell me what those are, there might be other problems waiting to arise too.

Solution!!

Updated noon EST.  I just sent this email to RefWorks support:

####

Okay, I think I’ve actually figured this out.

My data was indeed legal and valid UTF-8 . However, there are a variety of forms UTF-8 may come in. (See http://unicode.org/reports/tr15/).

It looks like RefWorks can only handle UTF-8 in “KC” normalized form.   When I manually translated my two test files to UTF-8 KC normalized form, RefWorks handles them properly:

http://www.refworks.com/express/ExpressImport.asp?vendor=Johns%20Hopkins%20University&filter=MARC%20Format&encoding=60051&url=http://testjr.mse.jhu.edu/refworks-error-kc.txt

http://www.refworks.com/express/ExpressImport.asp?vendor=Johns%20Hopkins%20University&filter=MARC%20Format&encoding=60051&url=http://testjr.mse.jhu.edu/refworks-improper-kc.txt

Those were translated manually; fixing my software to do this automatically for all RefWorks exports will be more work. But at least now I know what I need to do.

I strongly encourage you to actually document this RefWorks requirement, and let other people know about it when they report oddities in RefWorks UTF-8 imports.   I have spent quite a few hours confusingly figuring this out since I first reported the issue over a year ago — would be nice to save others the time and just tell them.

####

(After reading up more on unicode normalization, I suspect “C” normalization might make RefWorks happy too, and be less invasive/lossy than KC normalization. I’ll try to test that soon too.)

final(?) update Have confirmed that just “C” normalization keeps RefWorks happy, at least for my two test records, no need for possibly lossy “KC” normalization. Of course, it may be that for my particular test records at present, KC and C are identical. But I’ve spent enough time on this for now. “C” seems a better bet, unless we have evidence or specs from RefWorks (ha!) to the contrary.

It would make a lot more sense if RefWorks would accept any UTF-8, but do “C” or “KC” normalization itself on the receiving end if it needs it, but I do not expect sense.


Filed under: General
Categorie: LIS, stranieri