LIS, stranieri
Notes FRBR WEMI entities, physicality, interchangeability, merging
Originally coming out of a discussion on Code4lib and RDA-L between me, Karen Coyle, and others (too many others). Response rewritten for this forum.
Please keep in mind that of the Work, Entity, Manifestation, Item entity model, it’s really only Item that is an actual physical thing. All the others are abstract things, that I continue to believe are most easily thought of as sets of the things “below” them.
In traditional library cataloging, Items of the same Manifestation are considered interchangeable for our patrons. This is why we generally do not catalog below the Manifestation level, at the Item level. On the other hand, in some circumstances for rare books catalogers, items of the same manifestation are NOT interchangeable for their users, and this is why (I am led to believe) rare books catalogers sometimes DO catalog at the Item level).
Compare to an Amazon book page. When you look at an Amazon web page for “a book”, they REALLY mean they have dozens, hundreds or thousands of actual physical books in some warehouses somewhere they can sell you. But there isn’t a page for each Item, Amazon too considers items of the same manifestation generally interchangeable for it’s users, you don’t generally get to pick WHICH Item sitting in a warehouse you get. The typical Amazon page basically represents a Manifestation, as the typical traditional library catalog page does.
(But if you ordered a book on Amazon and they sent you the book translated in a different language, or in digital form on an Apple II 5.25″ floppy disk, or in the 1986 first edition when you ordered the 2009 fifth — you’d be pissed! All those things might be in the same Work or Expression, but for Amazon customers and for library users, Items from the same Work or Expression are not neccesarily interchangeable.)
So you can kind of get away with considering a Manifestation to be a real thing and talking about something “being” a manifestation, but it’s always good to remember that even a Manifestation is really an abstract set composed of a bunch of items.
But if you start thinking that an item in your hand can “be” a Work or Expression without “being” a Manifestation (and item!) too, you are setting yourself up for a lot of confusion. You can’t have a Work or Expression (or, technically Manifestation), without having all the things below it, up to Item.
Okay, so maybe there ARE a few edge case exceptions. Someone in a listserv conversation once suggested that if there’s a movie that is in pre-production, but nothing’s been filmed yet, maybe even a script hasn’t been written yet, maybe that Work exists, even though no Expressions, Manifestations, or Items of it do yet. Maybe an IMDB page for a pre-production movie represents a Work for which there are no E M or I. Could be useful to model things that way, sure, why not. But bear in mind that even in that weird case, you don’t have any item in your hand! The movie that hasn’t been made yet is a purely conceptual non-physical thing, so okay maybe it’s “just” a Work. As soon as you have something in your hands (a script, a daily DVD), you’ve got an Item. Which belongs to a Manifestation set, which belongs to an Expression set, which belongs to a Work Set.
That’s the way the FRBR WEMI entity model is intended, and that’s the only useful way I can figure out to think about it. If you forget this in your modelling, I think you wind up with poorly modelled unuseful data.
In a conversation with Karen Coyle on the lists, I think perhaps she was forgetting or getting confused about this, and in the process accurately recognizing that if you DO get confused about this, you wind up with data that’s tricky to use and reconcile and merge with other data.My response edited for this forum.
Karen Coyle wrote:
I think this becomes a question of how we express WEMI — you can always link from/to any WEMI using “contains” or “contained in” — so you can always link to all of the Works in an aggregate. What I would like to achieve is for different decisions (like one community calling the aggregate a Work/Expression and another focusing on the individual works and linking those to a Manifestation) to not create incompatible data.
Keep in mind that EVERY item-in-hand MUST be a Manifestation. At least this is my interpretation of FRBR.
If you have a bound volume that’s an “aggregate”, it HAS to be a manifestation. (as I argued above in this blog post)
So there’s no way to “call an aggregate a Work/Expression” instead of a manifestation, if that aggregate is an actual physical item in your hand.
You’ve got a manifestation whether you like it or not. The question is how much “authority work” are you going to do on identifying the Expression and Work it belongs to. If you don’t do much because it doesn’t make sense for you to do so, maybe it starts out modelled as a manifestation just belonging to a “dummy” Expression/Work that contains only that Manifestation. Some other cataloger somewhere else does the “authority” work to flesh out an Expression and/or Work that maybe contains multiple manifestations or maybe doesn’t. Is your data incompatible? Not really, it can be merged simply by recognizing that your “dummy” Expression/Work can be merged into their more fleshed out one.
There’s also a question of how much “authority work” you want to do on the _contents_ of the aggregate. Maybe you don’t want to spend any time on that “analytical” task at all, and your record does not reveal that the item in your hand IS an aggregate, it does not actually expose relationships to the other Works/Expressions contained within. It might have a transcribed table of contents as an attribute only, not as a relationship to other entities. Later some other cataloger fleshes that out. Here too, that other catalogers extra work can be (conceptually at least) easily “merged in” to your record, there is no incompatibility.
[If two different catalogers/communities decide that two different Works contain _different_ manifestations, and violently disagree, then THAT's an incompatibility that's harder to resolve and is a legitimate concern. But that's not what we have in this example, which is quite straightforward.]
I’ve had this ill-formed notion for a while that we shouldn’t actually be creating WEMI as “things” — that to do so locks us into a record model and we get right back into some of the problems that we have today in terms of exchanging records with anyone who doesn’t do things exactly our way. WEMI to me should be relationships, not structures. If one community wants to gather them together for a particular display, that shouldn’t require that their data only express that structure. I’m not sure FRBR supports this.
sound vague? it is — I wish I could provide something more concrete, but that’s what I’m struggling with.
While to some extent I sympathize with your inchoate thoughts about modelling WEMI being a mistake, and we’ve talked about that before — ultimately I still disagree. It is appropriate to use an entity-relation-attribute model to come up with the kind of explicit and formal model of our data that we both agree we need. It’s a conventional, mature, and well-tested modelling approach (I wouldn’t want to pin all our eggs to RDF experimentation that at least arguably does not rely on an entity model).
You can’t have an entity model without entities. The FRBR WMI (and more debatably E) entities are the ones that clearly come out of a formalization of our 100 year tradition of cataloging, meaning there’s probably something to them AND that using them makes retroactively applying the model to our 100 years worth of legacy data is more feasible (and BOTH of those facts are totally legitimate grounds for decision making. And the decision has already been made too, although in the case of FRAD I’d still be reluctant to accept it as a “done deal”, but in the case of FRBR, it is much better done, a much more useful and accurate abstraction of our cataloging tradition).
But you’re right that neither Work, Expression or Manifestation are “things” if you mean physical things. They are abstract things, they are sets of physical things, that it is useful for us to model so we can say things about them (including but not limited to which physical things are a member of them). It’s often useful to say things about things that aren’t physical things you can hold in your hand too.
If ALL you have are assertions about Manifestations (or worse Items!), then you’re going to end up duplicating a lot of assertions (see, I’m avoiding talking about records!) to assert something about every manifestation that belongs to the same Work, when your assertion is REALLY about the Work. A certain movie is a film adaptation of a certain work. Do you really need to make a bunch of RDF triples asserting that it’s a film adaptation of EVERY manifestation (or every single Item, every copy on someone’s shelves!) that exists of that Work? No, and it’s not even true, it’s not neccesarily an adaptation of any particular edition/manifestation (or if it is, you might know which one), it’s an adaptation of the Work.
We model the Work as an entity so we can make assertions about it, whether in records or in free-floating RDF assertion fantasy land. We can assert once that a film is an adaptation of a work, and we can assert that a bunch of manifestations/editions are all manifestations of that work (belong to that work-set), and then we can know that all of those Manifestations belong to the Work that was adapted into that film.
Filed under: General
Under-utilized marc field hall of fame: 043
Every once in a while I am reminded of the 043 marc field, and fantasize about using it in an interface some day.
It includes coded (controlled) information about the geographical topic of the item cataloged.
It appears in a surprisingly large number of records in many of our corpuses, even though hardly any of our systems do anything at all with it; seems like it could potentially be really useful, yeah?
Hey, the code list for marc geographic codes is actually (in rare form) provided in machine readable XML, even.
But, wow, it looks like the relationships in the marc geographic codes are just as odd as the infamous relationships in LCSH.
<gac> <uri>info:lc/vocabulary/gacs/ff</uri> <name authorized="yes">Africa, North</name> <code>ff</code> <uf> <name authorized="yes">Africa, Northwest</name> <uf> <name>Northwest Africa</name> </uf> </uf> <uf> <name authorized="yes">Islamic Empire</name> </uf> <uf> <name authorized="yes">Rome</name> <uf> <name>Roman Empire</name> </uf> </uf> <uf> <name>North Africa</name> </uf> </gac>“UF” is a thesaural abbreviation for “Used for”. Normally it indicates a non-authorized “lead in” term, but here some of them are labelled “authorized”? That’s the first weird thing.
But more importantly, am I misunderstanding things, or did that just tell me that “Roman Empire” and “Africa, North” are synonyms? That doesn’t seem right. The geographic area “Africa, North” may overlap with the geographic area “Roman Empire”, it may even be entirely subsumed by it, but surely they aren’t synonyms.
Follow the chain further, and, if “UF” is a transitive property (which I can’t understand any meaning of “used for” that would not be), we seem to be told that “Rome” is a synonym for “Africa, North.” I’m pretty sure that Rome is a city and doesn’t overlap with “Africa, North” at all.
Apparently “UF” doesn’t mean at all what one would assume it does. The question remains whether UF means anything that’s actually useable at all.
Note: Providing XML is good, but you’ve got to also provide some documentation of what the heck the XML means, whether by an XML schema or even just good narration.
Filed under: General
We met, we tweeted, we archived... then what?
We're all getting increasingly used to using Twitter as a back-channel at events. Indeed, it is now relatively uncommon to turn up for an event at which there isn't both a pre-announced hashtag and an active circle of twitterers already in attendance.
We also recognise that Twitter doesn't leave our tweets lying around for very long in the Twitter search engine and that if we want some kind of a more persistent and accessible record of Twitter activity at an event then we need to arrange for a copy of all the tweets to be archived somewhere. Normally, in my experience at least, TwapperKeeper is currently used to create that archive.
So far, so good... but then what? Offering a vanilla view of a few thousand tweets is potentially useful for those who want to delve into the detail, but it hardly provides an easy to grasp summary of the event. How can we present a view of the Twitter archive such that a summary is offered without the need to read every tweet?
There are some obvious simple things that can be done with the RSS feed of tweets offered by TwapperKeeper, and I've knocked together a quick demonstrator to show the possibilities...
Firstly, we can count up the total number of tweets, twitterers, hashtags and URLs tweeted during the event. That gives us an overall feel for how 'significant' the use of Twitter was.
Secondly, we can display a list of the people who tweeted and were @replied the most (in Twitter parlance, an @reply is a tweet that directly mentions another Twitter user). We can also see who was involved in most 'conversations' (exchanges of @replies between any two Twitter users). That gives us a feel for who was tweeting the 'loudest'.
Thirdly, we can look at what hashtags and URLs were tweeted the most. That gives us a feel for the topics and resources most related to the topic of the event.
And finally, we can unpick the individual words used in the Twitter archive, providing a kind of 'word cloud' for the event.
None of which is rocket science... but it is potentially useful nonetheless. Here are such summaries for the Repositories and the Cloud meeting that we recently organised with the JISC, for the JISC Dev8D event, and for the National Digital Inclusion 2010 conference (based on the associated TwapperKeeper archives for each of the events).
In a follow-up post to the NDI10 event, After the event, and a subsequent message to the UK Government Data Developers Google Group, Alex Coley suggests going further:
I wondered if a flash based tool could be used to map sentiment by session/topic by giving positive/negative meanings to words and applying this to tweet traffic. Perhaps some real meaning and value could come out of conferences that anyone can access and use.Sounds interesting, though I have no idea how to implement it!
Dave Challis of the Southampton ECS Web Team has also written up a couple of blog posts following Dev8D, A first look at the dev8d twitter network and Dev8D twitter network, part 2, in which he discusses the analysis of Twitter to see how people's social networks evolve during an event. Fascinating stuff!
article on born-digital preservation in NYT
http://www.nytimes.com/2010/03/16/books/16archive.html
Even if those storage media do survive, the relentless march of technology can mean that the older equipment and software that can make sense of all those 0’s and 1’s simply don’t exist anymore.Imagine having a record but no record player.
Actually, it’s of course worse! If I needed to in the apocalypse, I could listen to a vinyl record with a sewing needle and a cone of paper. (You could improve upon that design with some sort of suspended weighted arm for the needle. After the apocalypse, we’ll have plenty of time to perfect our technique). Although I won’t get stereo for an LP. But I can’t do much of anything with an 8″ floppy.
Leslie Morris, a curator at the Houghton Library, said, “We don’t really have any methodology as of yet” to process born-digital material. “We just store the disks in our climate-controlled stacks, and we’re hoping for some kind of universal Harvard guidelines,” she added.
I’m thinking Ms. Morris is regretting being quoted sounding so reactionary instead of innovative. Shouldn’t libraries be figuring this stuff out, not waiting for someone else (who?) to figure it out and tell us?
Among the challenges facing libraries: hiring computer-savvy archivists to catalog material; acquiring the equipment and expertise to decipher, transfer and gain access to data stored on obsolete technologies like floppy disks; guarding against accidental alterations or deletions of digital files; and figuring out how to organize access in a way that’s useful.
It is a challenge,no doubt. Isn’t this the business we’re supposed to be in though? If we want to convince everyone that we’re still relevant, well, we have to DO it. But that’s really up to administrators and funding priorities to some extent, I realize.
At the Emory exhibition, visitors can log onto a computer and see the screen that Mr. Rushdie saw, search his file folders as he did, and find out what applications he used. (Mac Stickies were a favorite.) They can call up an early draft of Mr. Rushdie’s 1999 novel, “The Ground Beneath Her Feet,” and edit a sentence or post an editorial comment.
Okay, that is pretty cool.
It may even be possible in the future to examine literary influences by matching which Web sites a writer visited on a particular day with the manuscript he or she was working on at the time.
Ha, when we have no privacy anymore, it’ll be a boon for historians! If we can manage to preserve it.
Located in Silicon Valley, Stanford has received a lot of born-digital collections, which has pushed it to become a pioneer in the field. This past summer the library opened a digital forensics laboratory — the first in the nation.
The heart of the lab is the Forensic Recovery of Evidence Device, nicknamed FRED, which enables archivists to dig out data, bit by bit, from current and antiquated floppies, CDs, DVDs, hard drives, computer tapes and flash memories, while protecting the files from corruption. (Emory is giving the Woodruff library $500,000 to create a computer forensics lab like the one at Stanford, Ms. Farr said.)
That’s pretty cool too. Okay, some libraries are indeed doing what needs be done.
Filed under: General
Umlaut as a bibliographic web service aggregator
Michael Beccaria wrote:
We will be switching over to VuFind this summer and I will likely use GB in a similar way with that interface as well. I plan (hopefully this summer) to build a web service that uses OCLC Web Services, Open Library, Hathi Trust, and Google Books to search for and return similar items from those resources to display in our catalog. I really like the service overall.
Incidentally, my Umlaut software can provide just such a web service. Umlaut is intended to do a lot more (it’s intended to be an OpenURL link resolver front-end), so it _might_ be overkill for that purpose, but it might still make sense to use it even without the link resolver just for it’s ability to provide a web service aggregating these (and other) services.
Umlaut has an architecture allowing plugins that consult other web sources in real time, like OpenLibrary, Amazon, Google Books, and HathiTrust. (All those are included as plugins right now; OCLC isn’t; some of the current plugins will only search on identifiers like ISBN, LCCN; others will do keyword searches. This could be changed). Plugins can run in parallel using threads, or can have specified order to run one after another (with the possibility of not running later ones if earlier ones returned results).
Results can be returned in HTML “link resolver” interface, or in XML or Json. Response includes information on plugins that are still “in progress”, if you’ve set it up for “waves” of execution, and client can keep polling until complete. (This waves/polling feature may be overkill for just what you want to do, but Umlaut supports it because I needed it for my more complicated use case). There is also the option to return an XML or Json response that has escaped rendered HTML embedded in the response, so the client can just plop already consistently rendered HTML in it’s own page somewhere, instead of re-rendering.
So, while intended as a “link resolver front-end”, what Umlaut has turned into is a pretty powerful framework for hosting external web service plugins, and aggregating them into a single web service. Might be overkill for what you want, but might come in handy. (Umlaut does NOT currently support any generic ‘caching’ architecture, which is something you’d want in a “general purpose framework for aggregating third party web services”. So I guess it’s missing that, I didn’t really need it enough to spend time on it, yet.)
I was gonna give you some examples, but I’m having trouble finding any that actually result in a GBS or HT or OL hit!
Filed under: General
Directions in Metadata with Karen Coyle
Karen Coyle, digital library consultant and bibliographic data expert, will discuss the future of metadata and its role in bibliographic data and the semantic web. Coyle will address what the major transformations in the use and structure of data already underway mean for libraries, and what librarians can do to prepare, adapt, and take advantage of new possibilities.
Open Q&A and discussion will follow the presentation.
Implications of MARC Tag Usage on Library Metadata Practice
The working group offers a set of factors to consider when making decisions about local MARC metadata practices in this report, as well as its views on MARC's future. In addition, the report includes recommendations for enhanced library data mining.
Data wells: one big index
Dublin core: the first fifteen years ...
QOTD: a new Alexandria
A plea: SirsiDynix makes (for now) two (or maybe three) ILSs
SirsiDynix makes Unicorn. SirsiDyinx also makes Horizon. SirsiDynix also makes Symphony, which you could call the new version of Unicorn or you could call it yet another ILS.
For some reason Unicorn customers are in the habit of referring to their software as “SirsiDynix”. I guess when just talking among themselves, this is fine if this is what they want to do.
But when you put this in comments in open source code I’m reading, it makes things REALLY confusing. Does “SirsiDynix” mean Unicorn or Horizon? I guess usually Unicorn. But it makes me have to stop and analyze whenever I see this in comments or variable names in code I’m working on, or in a post on a listserv I find via google or see via my subscription, or whatever.
We Horizon customers have a “SirsiDynix” product too! Fellow ILS hackers, if you remember this, and write “Unicorn” instead, it will make things a lot less confusing for me.
Filed under: General
Library Blog Awards
Omeka in the Cloud
Omeka.net will expand Omeka’s current offerings with a completely web-based service. No server or programming experience required. Similar to services offered by WordPress, the popular open-source blogging software, with the launch of Omeka.net users will be able to sign up for a free hosted Omeka site. Just create a username and password, and your online collection or exhibition is up and running.
This new hosted web service will further the Omeka project’s mission to make collections-based online publishing more accessible to small cultural heritage institutions, individual scholars, enthusiasts, educators, and students.
With Omeka.net, your online exhibit is one click away.OCLC and OIX
Common Tag
MARC 21 Update No. 11: Full and Concise available online
The changes are indicated in red in Update 11. Update 10 (October 2009) changes have also been kept in red since that update was only recently issued and 10 and 11 are being combined. Each format also has an appendix, "Format Changes for Update No. 10 (October 2009) and Update No. 11 (February 2010)" that lists the changes that comprise the combined update. The Web version of the formats is the official version and is considered the start for implementation planning for MARC 21. Users are not expected to begin using the new features in the format until 60 days from the date of this announcement: May 5, 2010. For more information about format documentation see: http://www.loc.gov/marc/status.htmlThe printed version of the update will be available through the Cataloging Distribution Service in the future. The print format update will combine Updates 10 and 11 into one update dated 2009/February 2010. The printed publications will be announced when they are ready for distribution.
Ting: collaboratively sourced library infrastructure
Mission of the library redux
The context web
more on weird OCLC business decisions
Originally posted in shorter version as a comment on a post by Karen Coyle on this issue…
The frustrating thing here is that libraries ARE willing to pay a reasonable amount to SUBMIT their holdings to an ILL service, such as OCLC’s, which (unlike their cataloging copy service) really has no competitive peers (yet…?).
Libraries get no DIRECT benefit from this — submitting holdings just means other libraries can more easily request things from YOU, and I don’t think fulfilling ILL requests is usually a profit center. Libraries are willing to do it just to serve the larger community, and out of “generalized reciprocity” where they realize that we all need to submit holdings so we call can request from each other.
Libraries ARE still willing to pay a reasonable fee to fulfill their community responsibilities to resource sharing. They’re just not willing to pay an UNREASONABLE fee, or to be ‘locked in’ to buying cataloging from a service that is not the best quality-to-price point for them, in order to continue resource sharing!
(MSU noted they pay tens of thousands of dollars for the reosurce sharing/ILL service, and are willing to keep paying that, just not an unreasonable per-record rate for loading:)
Regarding these statements, MSU’s Haka wrote, “The contention has been made that actions such as ours seek to undermine the WorldCat database. I would simply respond that the price currently quoted to upload these records into the database is the factor that should be questioned.” He also notes that the $88,500 MSU pays for resource sharing “does not seem like freeloading.”
So… you think we’ll see a SkyRiver resource sharing network too? (I guess III already has one? Maybe they’ll provide infrastructure to open it up to non-III libraries? Although III’s own reputation/history for promoting ‘lock in’ at all costs… Is SkyRiver itself open and useable by non-III libraries?)
I don’t know if OCLC’s actions are an intentional attempt at forcing ‘lock in’, or due to unfortunate lack of technical flexibility in their back-end systems.
But if the former, it’s just as likely to backfire, and cause them to lose the Resource Sharing business that libraries were perfectly happy to keep with OCLC at a reasonable price!
What I would do if I were king of OCLCWhich I am clearly not.
1) Work with SkyRiver to get OCLC numbers added to as many SkyRiver records as possible. Not share cataloging copy, but merely get a SkyRiver cataloging record to have a MARC field somewhere meaning “this record represents the same manifestation as OCLC record # N.” OCLC already has quite a bit of technical expertise with this kind of record matching, from their ‘reclamation service’. Charge SkyRiver a reasonable rate for this service, which SkyRiver is going to be willing to pay if the price is reasonable, because SkyRiver is worried about not being able to attract cataloging customers if they become locked out of OCLC resource sharing, and having OCLC numbers on the records will lead to…
2) Perhaps some of the apparently unreasonable expense of OCLC’s “just add holdings” quotes to MSU is not just an attempt to punish/enforce lock-in, but is actually because it’s expensive to load holdings from non-OCLC records, because you’ve first got to figure out what OCLC record they correspond to. But if the ‘foreign’ records have an OCLC number equivalency in them, as above, then it becomes technically must more feasible/efficient/cheap. So OCLC can offer a much more reasonable per-record price for loading holdings (not loading the records themselves, with another vendor’s copy; just holdings attachments) from records that have an OCLC equivalency in them.
OCLC gets to retain resource sharing record loading income from customers moving to other vendors for cataloging, instead of losing them entirely. OCLC gets a new revenue stream from vendors such as SkyRiver, paying to establish OCLC equivalencies on their records. SkyRiver’s happy, because their customers aren’t being ‘locked in’ to OCLC. Libraries are happy, because services have been “de-coupled” and they can choose the service at the best quality/price point for them. OCLC members who request items via the resource sharing service are happy, because their database of holdings continues to be as comprehensive as possible (and OCLC is happy that their valuable database of holdings maintains and increases in value with as many holdings as possible).
If lots of OCLC members move their cataloging away from OCLC, OCLC is still going to lose net revenue, but I think that’s inevitable at this point, the train’s already left the station. Better to establish revenue streams for their (without peer) resource network from libraries that are cataloging elsewhere, then to lose those too.
Many OCLC members are already purchasing cataloging from other vendors in addition to OCLC. Many of those records do not end up having holdings registered in OCLC. If those vendors could be brought into the fold as above, then it’s more revenue for OCLC, and more holdings in their database making their database more valuable (which is what gives value to their resource sharing service, and other services like WorldCat in the first place).
You can try to stick to the business model of 20 years ago, harming the interests of actual libraries in the process, and probably fighting a losing battle anyway. Or you can adjust to new environments with new business models. OCLC has still got a lot of valueable assets — discouraging people from supplying records to their resource sharing network (with infeasible prices) threatens to reduce the value of their assets.
In fact, even calling that a 20 year old business model might not be accurate. When OCLC still had competitors like RLG, did OCLC allow libraries who purchased cataloging copy from other sources to attach holdings for resource sharing at a reasonable cost? I’m not sure if they did or not. But if they did… what’s changed? If they did, that would make it seem like it’s less of a technical issue, and more of OCLC trying to take advantage of their (possibly short-lived) monopoly position to lock in customers. (Anti-trust issues?).
Filed under: General