Skip to content

MITH API workshop

I'm excited about the upcoming MITH API Workshop to be held in two weeks from Feb 25-26 at UMD :

The Maryland Institute for Technology in the Humanities will host a two-day workshop on developing APIs (Application Programming Interfaces) for the digital humanities. The workshop will gather 40-50 digital humanities scholars and developers, who along with industry leaders will demonstrate their APIs during this “working weekend.” We will discuss ways that existing and future APIs could be leveraged for digital humanities projects.

As someone who has been fascinated by APIs for years, I hope to learn a lot from my fellow digital humanities about what they care about. One of my tasks is to give an introductory talk about APIs.  What do I want to cover?  I'm still working out the exact structure, but the following topics come to mind:

  • What are APIs.  The relationship between web APIs (the focus of our workshop, I believe) and other APIs
  • How to learn more about APIs
  • APIs of specific interest to the digital humanities, with specific references to Freebase, Google geo-APIs, and OpenLibrary (organizations represented by fellow presenters)
  • Why does REST matter. (I'll only anticipate what fellow speaker Peter Keane will be bring up in his talk about REST)
  • How to consume APIs; What are mashups
  • How to deploy APIs
  • Open questions I think about

Stay tuned. Over the next two weeks, I'll work through these topics for myself (writing on this blog). I'll take this time as an opportunity to revisit what I wrote  in Pro Web 2.0 Mashups:  Remixing Data and Web Services and what I taught in my Mixing and Remixing Information course I taught at UC Berkeley over five years.

Slides for my talk on open government + Freebase

I'm looking forward to giving a little talk on open government + Freebase + Recovery Act data tonight at the Freebase meeting.   I'm deeply excited about the potential of open government data to change how we work with government (not to mention how we understand its workings).    Here are some slides that will help frame my talk tonight.

Tagged ,

Announcing Data Unbound: a new training and consulting firm

In 2006, after weblogging for some 6 years while working at UC Berkeley, I took on a new role as a data architect on campus.  I felt it important to keep blogging about my professional interests but to do so under a new moniker.  I came up with "data unbound" to name the passion I had for the myriad possibilities  latent in data, some of which I have strived to reveal.

A lot has happen since I started, the weblog.  I left my staff position at UC Berkeley so that I could devote myself more fully to the task of teaching others about the world of web APIs and mashups.   I wrote my book on the subject Pro Web 2.0 Mashups: Remixing Data and Web Services, which has been very well-received, I'm pleased to say.  Right now, I'm teaching my course Mixing and Remixing Information for the fifth time at the School of Information at UC Berkeley.  This year, I'm focusing the course on the rapidly expanding area of open government and the web.

And now, I (in partnership with my wife, Laura Shefler) have taken the next step of formally starting Data Unbound LLC:

Data Unbound LLC is a training and consulting company that helps organizations access and share data effectively. The value of your data, when it is scattered throughout multiple databases and applications, grows if you can make it all work together. This value increases further when you leverage your information resources with the vast world of data on the Web. Our specialty is helping you to use APIs (application programming interfaces) to integrate data across your organization and beyond.

We're open for business, ready  to work with clients  to solve their data problems.  Our training will enable their organizations to integrate data, both their own and that of others through APIs and data standards.  I encourage you to read more of what we have written on, in which we detail our approach and our offerings.  In the next months, I'll be describing how general principles behind data integration and web APIs can solve your  problems in your specific context. And if you know anyone who make use of Data Unbound, by all means, put them in touch with us.

ARRA Treasury Account Symbols: the outcome of our FOIA request

In July, I wrote about why I've been looking for Recovery TAFS and appropriations. In an attempt to get an official list from the US federal government, Eric Kansa and I sent a FOIA letter to OMB to request the release (in electronic form) of a complete and up-to-date list of all Recovery Act (ARRA) TAFS (Treasury Appropriation Fund Symbols). We had known of two out-of-date and potentially incomplete lists of the ARRA TAFS:

  1. the worksheet entitled "92_AARP_TAFS_DD_Detail" in May 8, 2009 weekly report from USAID
  2. a pdf published by ProPublica on April 1, 2009.

We specifically asked for an up-to-date Excel spreadsheet with the same columns as the worksheet "92_AARP_TAFS_DD_Detail" — but with an explanation of what each of the columns meant.  We  also encouraged the OMB to make this data available on an ongoing basis as an XML document published on the OMB website and kept up to date, with an explanation of each field.

Last week, we got what we asked for:  an Excel spreadsheet ( see Internet Archive metadata), which I've also uploaded as a Google spreadsheet.  Note the description of the spreadsheet to be found in the first sheet:

In a letter dated August 24 to OMB's Freedom of Information Officer, you requested that OMB provide you with an up-to-date Excel spreadsheet with the same columns as a worksheet you emailed on October 16. The Berk_FOIA_Data tab in this Excel file provides up-to-date information using the same columns in the file you sent. The information is up-to-date as of October 19, 2009, and shows a list of each Treasury Appropriation Fund Symbol (TAFS) associated with the Recovery Act (RA). Below is a description of each column in the Berk_FOIA_Data tab.

I've not had an opportunity to complete my analysis of the FOIA spreadsheet  and to correlate the data to the recipient reporting.   You'll note that there are 342 TAFS in the spreadsheet.  To derive a list of Treasury Account Symbols (TAS as opposed to TAFS), we concatenate the  Treasury Agency Code with the Treasury Bureau Code (separated by a '-') and bundle all  the corresponding TAFS.   See the resulting list, with a total of 313 TAS.  You'll note that a spreadsheet that lists the TAS as of Sept 13, 2009 has 309 symbols, while the HTML list on currently lists 327 TAS (along with 32 place-holder symbols).   The differences in those lists is something to nail down next.  At any rate, even something like the list of Treasury Accounts associated with the Recovery Act is more fluid than what I would have expected at this point.

One thing that has puzzled me is why there are so many TAFS with $0.00 for the treasury warrant.  You find an explanation in the FOIA spreadsheet:

Treasury Warrant is the sum that Treasury warranted to the TAFS. You can think of a warrant as being the initial deposit in a new checking account. For many of the TAFSs on the list, you can track the amounts appropriated in the law to the amount of the Treasury warrant. In some cases, however, you cannot track back to actual amounts because the funding in the law is formula based. In many cases, a TAFS has a zero in the Treasury Warrant column. The primary reason for this is that these TAFSs receive RA funds via a transfer from other TAFSs.

Hmmm.  We're going to have to understand the relevant formulas.

Acknowledgement:  A big thanks to Brian Carver for providing us valuable advice on how to formulate, draft and send a FOIA request and helping us to interpret what's happening during a FOIA process.

Tagged , , , , , , , , ,

Web Services for

Today, my colleagues Erik Wilde, Eric Kansa, and I are pleased to announce our new report "Web Services for" and its companion website   Last week, the redesign of was made public to much fanfare. is  the U.S. government’s official website for publicly documenting how funds from the American Recovery and Reinvestment Act of 2009 (ARRA) have been allocated and spent.   Our work  focuses on a crucial aspect of that has yet to receive sufficient attention, namely, how data Recovery Act spending will be made available in machine-readable form for analysis, interpretation, and visualization  by third-party applications. In our report and in our website, we propose a reporting architecture,  created some sample feeds based on that architecture, and demonstrate how that data could be used in a simple map-based mashup.

Here are some highlights from our report, which I quote (with a bit of editing):

  • Design priorities for need to shift from focusing on deploying an attractive Web site toward designing ARRA web services to support reuse of data in third-party applications.
  • These services should allow any party  to receive the complete set of ARRA reporting data in a timely and easily usable manner, so that in principle, the full functionality of could be replicated by a third party.
  • Our proposed architecture is based on the principles of Representational State Transfer (REST) and always attempting to use the simplest and most widely known and supported technology for any given task.
  • We recommend the feed-based dissemination of ARRA reporting data using the most widely used technologies on the Internet today: HTTP for service access, Atom for the service interface, and XML for the data provided by the service. This approach allowing access from sophisticated server-based applications or from resource-constrained devices such as mobile phones.
  • The manner which data flows from to is of critical importance. Ideally, should use Web services offered by
  • We strongly recommend that Recovery reporting systems adopt the Atom syndication format for feeds.  Feeds represent a major positive development in making government data more open to citizen review and reuse and provide a unique ability to do so by merging utility for humans as well as machines.
  • While not formally standardized, feed autodiscovery is well supported by current browsers and could be implemented reliably with a well-defined set of implementation guidelines for Web pages offered by
  • We strongly recommend making feed paging and archiving mandatory, so that the feeds are not just a temporary way of communicating that information has become available. Instead, the feed pages should be available as persistent and permanent access points, so that accessing information via feeds can be done robustly and reliably.
  • ARRA data dissemination services should be more resource-oriented than service-oriented.  XML representations should contain links (in the form of URIs) to related data resources, thereby representing the relationships between the different concepts which are relevant for reporting.
  • The Recovery reporting schema uses many different coding systems and identifiers. Publication of resources related to some of these identifiers will be of great value.  (We list key identifiers in the report.)
  • There are many possible analyses that people may wish to perform on Recovery data,  making it difficult  to accommodate them all. Therefore, querying services should be oriented toward making machine-readable representations of data available, so that third party developers can easily populate their own analysis engines and run their own specialized algorithms on that data.

Erik Wilde has also commented on our report. We welcome and look forward to your feedback.

Finally, we are grateful to the Sunlight Foundation for a grant that helped to support this effort.


Advice for

Rusty Talbot posted the following request for feedback on the Sunlight Labs list this morning

The Recovery, Accountability, & Transparency Board wishes to have an open discussion with all interested developers about how data should be made available via

As you are all aware, a new version of will be released soon. From a data standpoint, the initial release of the new site will replicate existing functionality. However, the Board aims to set a new standard of transparency with this site and would therefore like to make the data available in the most convenient and straightforward way (or ways) possible so you can use and analyze official, up-to-date Recovery Act data. We need your input to achieve this goal.

Please let us know how the site could best meet your needs in terms of  machine-readable data format(s) and standards, APIs, guidance, training, etc. [emphasis mine]

As I waited for Rusty to respond to my question of how best to provide feedback, Luigi Montanez went ahead with posting a series of excellent pointers.  I second Luigi's advice, also commend  the recent OMB Watch Recovery Act Transparency Status Report)  and have similar general web development advice to offer, which I had written up as "Making Your Web Site Mashable" (pdf)  (Chapter 12 of my book Pro Web 2.0 Mashups).

In terms of work specifically related to the Recovery Act. my Berkeley colleagues Erik Wilde, Eric Kansa, and I published a report "Proposed Guideline Clarifications for American Recovery and Reinvestment Act of 2009" in which we proposed and prototyped  the use of Atom feeds to disseminate Recovery spending data.  We are currently at work on updated recommendations based on the latest Recovery Act OMB Guidance.

One of my most important things that has made the Recovery spending less-than-transparent is how difficult it has been to locate basic accounting data.  For example, after looking for months, I have yet to locate a reliable list of Recovery TAFS, basically a list of all the pots of money (as tallied by Treasury) and the maximum amount of money we expect to see in each pot (the dollars appropriated).  Now, does list the amounts obligated and spent by agency, but how much money has been appropriated?  That basic data should be clearly documented at, so that we can track the flow of money reliably from the originating legislation to Treasury out to the agencies  and then to contractors and grantees  or the states.  (I will note that ProPublica's Stimulus Tracker does break down the totals by agency but doesn't publish the list of individual accounts.)

At any rate, there is more to say — but I'll wait until Rusty responds to what is here.


calendar data from Educause put into a Google Calendar

I'm starting to prepare my notes for the pre-conference seminar Creating and Enabling Web Mashups that I'll be leading on November 3, 2009 at 2009 EDUCAUSE Annual Conference.   I'm looking for good examples to use in the seminar.  One that I'm contemplating is showing how to import the Educause 2009 calendar, which is available as an iCalendar file (linked from the main program page.)  If you import the icalendar file, you can produce a Google calendar: (You have to navigate to November 2009 to see any events.)

Tagged , ,

plotting data for counties on Google Maps: Part I

There is a huge amount of government and socio-economic data in general  gathered at the county level.  It would be nice to be able to plot that data on an desktop or online map (e.g., Google maps).  This morning I posted a question on the  Sunlight labs mailing list asking for some help:

I would like to display US counties on a Google map based on some  scalar value (e.g., population)  for each county and a color map that associates values to colors.  Does anyone know of a library that makes this easy to do?  (I'm interested in doing the same for other adminstrative regions, such as zip codes and congressional districts.)

( contains a good discussion of the topic — and I have found other references that might be helpful,  but I have not seen the functionality I'm looking for distilled down into an easy-to-use library.)

Building a ground overlay

When I tweeted my question, I got a very helpful response from Sean Gillies:

That's a lot of polygons (3489, see to draw in the browser. Make an image layer with OpenLayers?

Sean confirmed what I was thinking that I had to compute a static image to use as an overlay — otherwise drawing 3000+ polygons with slow down Google maps prohibitively.   In fact, in many ways, I've been trying to use the approach I've seen from the demo gallery of the Google Maps API v3:   John Coryat's  ProjectedOverlay example, which "uses OverlayView to render an image inside a given bounding box (LatLngBounds) on top of the map".  (You can look at the overlay image (.png) directly and reuse ProjectedOverlay.js)

So one approach would be to calculate a png of the counties (colored appropriately), and this png would provide an efficient way to display county data.  I had started down this road a while ago — Sean's post gave me some more direct guidance in how to create a useful Python-based desktop GIS setup to be able to handle such tasks as creating my desired map in a png form.  To be honest, I've found the whole open source GIS world fairly confusing.  I bought and read part of Gary Sherman's Desktop GIS: Mapping the Planet with Open Source Tools. (Illustrated edition. Pragmatic Bookshelf, 2008) and was considering installing FWTools, GRASS GIS, and Quantum GIS. His post alerted me to, and convinced me to try OSGeo4W , which is

a binary distribution of a broad set of open source geospatial software for Win32 environments (Windows XP, Vista, etc). OSGeo4W includes GDAL/OGR, GRASS, MapServer, OpenEV, uDig, QGIS as well as many other packages (about 70 as of summer 2008).

I installed OSGeo4W but have not been able to figure out the Python bindings (and hence can't yet try out the code that Sean posted).   Neither has the Python setup from FWTools 2.4.3 worked for me.  My next steps is to follow the instructions at Python Package Index : GDAL 1.6.1 to see whether I'll have better luck.

Joshua Tauberer's WMS service

Joshua Tauberer of responded to my query by referring me to his experimental WMS service, which produces WMS layer for entities ranging from Congressional and state districts to counties.   I modified one of the examples that  to try to plot the counties.   For some reason, not all the counties show up yet.  Still, this approach is very promising since it would save me the work of calculating the coordinates of the county boundaries to begin with.  I have to come back to study and apply the techniques documented at WMS Server API Documentation.

Other things to study further

Tagged , , ,

I'm looking forward to Transparency Camp 2009

I'll be at TransparencyCamp 2009 tomorrow (You can follow the conference tweets at #tcamp09, whether or not you'll be in physical attendance.) Since TCamp09 is an unconference, any formal agenda will be determined at the conference by sessions attendees propose there. I'd like to see and attend sessions on the following topics:

  • projects/techniques to track the fiances of the US Government. I've been working on tracking the Recovery Act (aka the Stimulus) and would like to compare notes with others involved with understand how budgets are created, and money allocated and spent at the federal level.
  • projects/techniques on how to generate an ontology or mapping of the structures of the federal and state governments (e.g., how would we map the US Government Manual into structured machine-readable form?)
  • I'd love to hear Joshua Tauberer tell us about and Carl Malamud about
  • business/sustainability models around government transparency projects. I'd like to devote more time to government transparency, but how do we pay the bills?

A clarification of why I'm looking for Recovery TAFS and appropriations

In response to a question I received on a mailing list in response to my query  Does anyone know of a complete and up-to-date list of Recovery Act accounts? concerning why I was looking for amounts appropriated and not just obligated an spent for the Recovery, I wrote the following clarification (which I have edited lightly):

In addition to the amount of money that is obligated and spent, isn't there also the amount money that is appropriated?  The amount obligated and spent goes up, but isn't the appropriation supposed to be maximum that the obligated and spent amounts ever reach?  (I'm an accounting newbie, so correct me if I misunderstand what these terms mean.)  What I'm trying to understand right now are statements like "ARRA is a $787 billion dollar bill" and the Department of Education is getting a "$100 billion".   Specifically, I'd like to see how various line items add up to the totals quoted.

The amounts obligated used to be reported in the weekly excel spreadsheets from the agencies.  For example, consider the April 3 report from the Department of Ed:

and the corresponding spreadsheet:

At, we're told that:

  • Total Available: $11,363,064,856
  • Total Paid Out: $0

The spreadsheet (specifically the "Weekly Update" worksheet) actually supports this statement — here, I copy the table and add the totals line.

Program Source/ Treasury Account Symbol: Agency Code Program Source/Treasury Account Symbol: Account Code Program Source/Treasury Account Symbol; Sub-Account Code (OPTIONAL) Program Description (Account Title) Total Appropriation Total Obligations Total Disbursements
91 0103 IMPACT AID, RECOVERY ACT $100,000,000 $0 $0
91 0196 HIGHER EDUCATION, RECOVERY ACT $100,000,000 $0 $0
91 0197 INSTITUTE OF ED SCIENCES, RECOVERY ACT $250,000,000 $0 $0
91 0198 STUDENT AID ADMIN, RECOVERY ACT $60,000,000 $0 $0
91 0199 STUDENT FINANCIAL ASST, RECOVERY ACT $16,483,000,000 $198,901,281 $0
91 0207 INNOVATION & IMPROVEMENT, RECOVERY ACT $200,000,000 $0 $0
91 0299 SPECIAL EDUCATION, RECOVERY ACT $12,200,000,000 $5,970,012,399 $0
91 0302 REHAB SRVCS & DISABILITY RSRCH, RECOVERY ACT $680,000,000 $315,570,633 $0
91 0901 ED FOR THE DISADVANTAGED, RECOVERY ACT $13,000,000,000 $4,878,580,543 $0
91 1001 SCHOOL IMPROVEMENT PRG, RECOVERY ACT $720,000,000 $0 $0
91 1401 OFC OF INSPECTOR GENERAL, RECOVERY ACT $14,000,000 $0 $0
91 1909 ST FISCAL STABILIZATION FUND, RECOV ACT $53,600,000,000 $0 $0
Total $97,407,000,000 $11,363,064,856 $0

You'll see that the total amount obligated and disbursed match what's listed on the web.  What my previous post  is trying to get at is

1) how to get an up-to-date list of all these accounts (there are 12 listed for education here, but in a tally I'm working on, there are 14)


2) what the the appropriation for each account is.  I'm happy to see the total appropriation for Dept of Ed as $97,407,000,000 — since it matches what ProPublica lists at — not to mention statements like "The American Recovery and Reinvestment Act of 2009 (ARRA) provides approximately $100 billion for education" (

Once I have an accurate list of TAFS (e.g., 91-1909 for the State fiscal stabilization fund = $53.6 billion), then I'm use that list to slot the spending data.