Friday, January 18, 2013

open API is not open data

Today I made a comment on twitter about the ownership of the crowdsourced database of Mendeley, and so I decided to look for a few comments on the difference beween open data and open API.

Publishing Open Data – Do you really need an API? | Peter Krantz:
An alternative to the direct integration API model is to publish data dumps in files. “Boring!” may be the initial reaction from developers but they will thank you later. In this model data from the database is exported, transformed to an open readable format [1] (e.g. CSV), properly named and stored on the web server [2]. This means entrepreneurs can get all your data, load it into their own system and design their API according to their use case. Also, high loads will hit their own infrastructure without affecting other apps.
Mendeley’s Open API Approach Is On Course To Disrupt Academic Publishing | TechCrunch:
Meanwhile, Elsevier has been trying to build a similar thing to Mendeley but their philosophy is the exact opposite. The API is not an open access one, but paid-for and closed, open only to paying university customers. They have put a lot of marketing behind it, holding ‘hack days’ etc. Their third party apps have now reached around the 100 mark – although they do allow for easier monetisation while Mendeley’s are free.
(this article wrongly claims Mendeley's open API is open data, but it's interesting that they compare with Elsevier)
iPhylo: On being open: Mendeley and open data versus open source:
For me the question of whether the source code for the Mendeley desktop will be made open source is a red herring, and ultimately a distraction from the real question — will the data be open?
Unilever Centre for Molecular Informatics, Cambridge - Does “Open API” mean anything? « petermr's blog:
I have no particular quarrel with Mendeley – they are innovating by putting power in the hands of clients and that’s gently useful. But unless they actually are Open according to the OKD then they aren’t giving us much (and this applies to many companies – and almost all in chemistry). (...)
I have no major complaint with a company which collects its own data and offers it as Free – Google does this and many more.
But don’t call it Open.

Don’t use that open API — it could be a trap!:
What the and Twitter cases reinforce is the dual nature of an open API, as entrepreneur Syed Iqbal noted recently: it can be an incredibly useful tool for other startups, especially if it allows them to tie into a larger platform or network — such as Twitter, or Google, or Facebook — and take advantage of the size and reach of that partner to grow more quickly. And for the platform company itself, all of those developers and outside services can add value relatively quickly (and cheaply) to the network.
The flipside is that when a network or platform gets large enough, as Twitter has, having all those tiny developers and outside services can seem more like an unnecessary bug than a crucial feature — especially if the company wants (or needs) to take control of some of those external features and apps in order to monetize its network effectively.

Update 2013.01.18

I remember this article saying that the difference between "open" API and open source is when the API provider disappears. It was mentioned that Peter Murray-Rust and the Open Knowledge Foundation say that the Mendeley API is Open Data. While I'm not in position to discuss with them, I think it depends very much on your definition of data: if by "data" they mean the bibliographic reference then maybe yes, Mendeley offers a way of getting all entries typed by the user (although no data dump may make it practically impossible to get). And in this sense PMR agreed on the openness I believe. But as for the relational data -- which is where I think the crowdsourcing takes place --, I don't think we have access to it. That is, the association between the PDF file (through its metadata or checksum) and its DOI or its bibliographic info, this may not be openly available. The differential of Mendeley is not to have a 'huge bibtex' file, but to have a 'huge curated bibtex file' where the curation process associated the entries to the pdf documents themselves. I've never used Mendeley's API and I'm not familiar with its internals, so please read my rant cum grano salis. I might be wrong here. But still I'm afraid there's no way to migrate to another platform that can use this database as we use it now, in case we want to leave Mendeley: if I'm not mistaken, even if I manage to download all information available trough the API (is this dowloaded data set all the info that's in their database or something's missing? Dunno, not OSS), I cannot find the entry for a given PDF file I know is on the database. Or can I retrieve the filehash/metadata fields for all entries?

This condition for openness was already observed by PMR:
An API should not precludes [sic] access to the raw data. And that’s where the “data” question still needs to be answered. 
And as for the OKFN, the current status of Mendeley is "unresolved" (nobody replied).

Just to be clear, I use Mendeley a lot and really think they are doing amazing things. I wish them the best irrespective to whom they partner with, and I certainly don't think Elsevier or some other company are the evil incarnated. I always worried about not being the owner of the data I helped creating, but then I realized that we do this all the time, and it's just a matter of tradeoff between what I give as input and what I get from it. Google does the same, as well as all loyalty programs or even supermarkets. If they use the information I provide in innovative ways to offer me better products or services, then I can voluntarily accept this exchange. Which I'm glad I did with Mendeley. And I also understand that there might be legal limits to the info that Mendeley can offer freely -- I'm certainly not demanding that they release all the info even if it clashes with some copyright agreement.

I just wanted to highlight a drawback of APIs -- calling them Open or not. APIs hinder the access to the primary data: how much do I have to invest (in time and effort) to make sure I have the complete data set? With Open Data, Open Source, Open Access, that's easy: the one offering the information is responsible for guaranteeing it. And that, in my view, we should not call a crowdsourced data  "Open" if we don't have access to 1) the labor we contributed and/or 2) all data components that make this data innovative.


  1. A few more links: (so all APIs are "open"?)

  2. In the wake of the death of google reader, TechDirts's Mike Masnick has this to say about how we are going in the wrong direction with the apps versus data:

    But, for the most part, all of the stories that people talk about concerning "cloud" computing almost always involve services that tie together the app and the data and all you're really doing is trading the former limitations of local data for the limitations of a single service provider controlling your data. Many service providers want this, of course. It's a form of lock-in.


Before writing, please read carefully my policy for comments. Some comments may be deleted.

Please do not include links to commercial or unrelated sites in your comment or signature, or I'll flag it as SPAM.


Related Posts with Thumbnails