Data Unbound

Helping organizations access and share data effectively. Special focus on web APIs for data integration.

September 24th, 2009

Advice for recovery.gov

Rusty Talbot posted the following request for feedback on the Sunlight Labs list this morning

The Recovery, Accountability, & Transparency Board wishes to have an open discussion with all interested developers about how data should be made available via Recovery.gov.

As you are all aware, a new version of Recovery.gov will be released soon. From a data standpoint, the initial release of the new site will replicate existing functionality. However, the Board aims to set a new standard of transparency with this site and would therefore like to make the data available in the most convenient and straightforward way (or ways) possible so you can use and analyze official, up-to-date Recovery Act data. We need your input to achieve this goal.

Please let us know how the site could best meet your needs in terms of  machine-readable data format(s) and standards, APIs, guidance, training, etc. [emphasis mine]

As I waited for Rusty to respond to my question of how best to provide feedback, Luigi Montanez went ahead with posting a series of excellent pointers.  I second Luigi's advice, also commend  the recent OMB Watch Recovery Act Transparency Status Report)  and have similar general web development advice to offer, which I had written up as "Making Your Web Site Mashable" (pdf)  (Chapter 12 of my book Pro Web 2.0 Mashups).

In terms of work specifically related to the Recovery Act. my Berkeley colleagues Erik Wilde, Eric Kansa, and I published a report "Proposed Guideline Clarifications for American Recovery and Reinvestment Act of 2009" in which we proposed and prototyped  the use of Atom feeds to disseminate Recovery spending data.  We are currently at work on updated recommendations based on the latest Recovery Act OMB Guidance.

One of my most important things that has made the Recovery spending less-than-transparent is how difficult it has been to locate basic accounting data.  For example, after looking for months, I have yet to locate a reliable list of Recovery TAFS, basically a list of all the pots of money (as tallied by Treasury) and the maximum amount of money we expect to see in each pot (the dollars appropriated).  Now, Recovery.gov does list the amounts obligated and spent by agency, but how much money has been appropriated?  That basic data should be clearly documented at Recovery.gov, so that we can track the flow of money reliably from the originating legislation to Treasury out to the agencies  and then to contractors and grantees  or the states.  (I will note that ProPublica's Stimulus Tracker does break down the totals by agency but doesn't publish the list of individual accounts.)

At any rate, there is more to say — but I'll wait until Rusty responds to what is here.

July 7th, 2009

My project idea for the Freebase Hack Day

[Post in progress]

In this post, I will write about my project proposal for the upcoming Freebase HackDay.

The project is to elaborate the prototype at An org chart of the US Federal Government Based on OMB agency and bureau codes.

See what I've written at

http://www.freebase.com/discuss/threads/guid/9202a8c04000641f800000000c697aa0:

I'm writing up a longer post right now, but let me list a few things I'd love help with:

1) to do the reconciliation of governement agenices to Freebase, I built a primitive acre app to help me apply Freebase suggest on a lot of items: http://suggest2reconcile.freebaseapps.com/ — see source:  http://acre.freebase.com/#app=/user/rdhyee/suggest2reconcile&file=index and a background writeup of the idea: http://lists.freebase.com/pipermail/developers/2009-June/003014.html Refining this app would be very useful!

2) as part of the reconciliation process, coming up with a good way to figure out from the suggest API whether a given suggestion is given with high confidence or not would be helpful.  Tom Morris has some ideas in http://lists.freebase.com/pipermail/developers/2009-June/003015.html

3) writing the data back from the reconciliation would be very useful.  The data behind http://labs.dataunbound.com/doc/2009/06/govt.treeview.v0.1.html is http://labs.dataunbound.com/doc/2009/06/OMB_A_11_C_reconciled.v0.1.xml — how to model the OMB codes and apply them to the government agenices in Freebase?  How about the entitites I couldn't find Freebase — should we create new entities for them?

4) Re what Spencer wrote:  yes, I'd love to see someone come up with a better visualization than what I have at http://labs.dataunbound.com/doc/2009/06/govt.treeview.v0.1.html — especially if there is a generic viewer.

April 30th, 2009

Previous recommendations would say "open the data" to Recovery.gov

As many have jumped into making recommendations on how Recovery data  should be packaged and disseminated, I'm reminded of some important previous work in this area.

The first is the ACM U.S. Public Policy Committee (USACM) Recommendations on Open Government. I have a tremendous respect for the ACM as "the world’s largest educational and scientific computing society". The ACM U.S. Public Policy Committee (USACM) "serves as the focal point for ACM's interaction with U.S. government organizations, the computing community, and the U.S. public in all matters of U.S. public policy related to information technology."   The policy statement on "open government"  first sets the context for its recommendations:

Individual citizens, companies and organizations have begun to use computers to analyze government data, often creating and sharing tools that allow others to perform their own analyses. This process can be enhanced by government policies that promote data reusability, which often can be achieved through modest technical measures. But today, various parts of governments at all levels have differing and sometimes detrimental policies toward promoting a vibrant landscape of third-party web sites and tools that can enhance the usefulness of government data.

The recommendations  "for data that is already considered public information" are:

  • Data published by the government should be in formats and approaches that promote analysis and reuse of that data.
  • Data republished by the government that has been received or stored in a machine-readable format (such as online regulatory filings) should preserve the machine-readability of that data.
  • Information should be posted so as to also be accessible to citizens with limitations and disabilities.
  • Citizens should be able to download complete datasets of regulatory, legislative or other information, or appropriately chosen subsets of that information, when it is published by government.
  • Citizens should be able to directly access government-published datasets using standard methods such as queries via an API (Application Programming Interface).
  • Government bodies publishing data online should always seek to publish using data formats that do not include executable content.
  • Published content should be digitally signed or include attestation of publication/creation date, authenticity, and integrity.

The second is a set of Open Government Data Principles formulated in October 2007  by the Open Government Working Group,  "30 open government advocates gathered to develop a set of principles of open government data":

Government data shall be considered open if they are made public in a way that complies with the principles below:

1. Complete
All public data are made available. Public data are data that are not subject to valid privacy, security or privilege limitations.
2. Primary
Data are collected at the source, with the finest possible level of granularity, not in aggregate or modified forms.
3. Timely
Data are made available as quickly as necessary to preserve the value of the data.
4. Accessible
Data are available to the widest range of users for the widest range of purposes.
5. Machine processable
Data are reasonably structured to allow automated processing.
6. Non-discriminatory
Data are available to anyone, with no requirement of registration.
7. Non-proprietary
Data are available in a format over which no entity has exclusive control.
8. License-free
Data are not subject to any copyright, patent, trademark or trade secret regulation. Reasonable privacy, security and privilege restrictions may be allowed.

Compliance must be reviewable.

The final is the paper “Government Data and the Invisible Hand.” (Yale Journal of Law & Technology 11: 160.) by David Robinson, Harlan Yu, and Edward Felten.  The abstract contains the following recommendation:

Today, government bodies consider their own websites to be a higher priority than technical infrastructures that open up their data for others to use….It would be preferable for government to understand providing reusable data, rather than providing websites, as the core of its online publishing responsibility.

In  ProgrammableWeb last year, I distilled the paper's argument as follows:

The conclusion is based on a claim that the executive branch is comparatively ineffective at creating tools for presenting data and should therefore leave that work to a private sector (either nonprofit or commercial entities) that is best able to respond to a wide variety of possible uses for government data. That doesn’t mean that the government should provide no user interface to the data, but rather “should focus on creating a simple, reliable and publicly accessible infrastructure that exposes the underlying data.” Fancier interfaces and tools should be built by others.

Moreover, the authors have recommended a specific mechanism for ensuring that the government does not privilege any user interface over their public data infrastructure: “require that federal websites themselves use the same open systems for accessing the underlying data as they make available to the public at large.”

Let me now make sure that these recommendations are at least referenced somewhere at the "National Dialogue" around the Recovery.

April 10th, 2009

Typographical or semantic irregularities at recovery.gov?


Irregularities at recovery.gov

Originally uploaded by Raymond Yee

Why are there two reports with the same date? This screenshot is from the reports from the Department of Labor on recovery.gov.

March 20th, 2009

Wilde, Kansa, and Yee "Proposed Guideline Clarifications for American Recovery and Reinvestment Act of 2009"

Earlier in the week, Erik Wilde, Eric Kansa, and I published our technical report Proposed Guideline Clarifications for American Recovery and Reinvestment Act of 2009, a set of technical guidelines for how we think recovery.gov should publish data about how stimulus money is being spent and a prototype of what people can do with the data if data were published accordingly.  Here's the abstract of our report:

The Initial Implementing Guidance for the American Recovery and Reinvestment Act of 2009 provides guidance for a feed-based information dissemination architecture. In this report, we suggest some improvements and refinements of the initial guidelines, in the hope of paving the path for a more transparent and useful feed-based architecture. This report is meant as a preliminary guide to how the current guidelines could be made more specific and provide better guidance for providers and consumers of Recovery Act spending information. It is by no means intended as a complete or final set of recommendations.

The technical heart of the work would be the XSD schemas for communications, formula block grant allocation, and weekly report feeds. But the most fun part is to looking at how some fake data would appear, displayed in a mashup of Google Maps and Simile Timeline. We made up some data because at the time of analysis, there wasn't much in the way of real government data to use. We hope that situation will change soon.

What next for me on this front?  Revisiting the questions I posed in my March 7 post (Some questions about the implementation guidelines for the recovery feeds) to see whether I can now answer them.

March 4th, 2009

A meetup around government transparency: Wednesday, March 11, 2009 in SF

I'm co-leading a meetup around government transparency in San Francisco next Wednesday, March 11, 2009:

Democracy Web

President Barack Obama has promised an era of unprecedented transparency for the US government. In anticipation of vast flows of data from the federal government in the weeks and months to come, we are organizing a SF Bay Area local interest group around tracking and interpreting this data. An immediate catalyst and focus for this first meeting is the Sunlight Foundation's Apps for America Competition (with entries due March 31, 2009). However, we welcome folks interested in the larger topic of government transparency, whether they plan to take part in the immediate competition.

Location: Metaweb Technologies 631 Howard St, 4th floor San Francisco, CA 94105
Time: 6:30pm

All are welcome  but please RSVP on  meetup.com.

February 28th, 2009

New Creative Commons license: CC0 — “No Rights Reserved”

The new About CC0 — “No Rights Reserved” Creative Commons license has been released as 1.0. This new license gives "creators a way to waive all their copyright and related rights in their works to the fullest extent allowed by law." Great. Wondering when it'll be available for regular users to associate with their own photos on Flickr — as ooposed to the public domain assertion licensing provision already in use in the Flickr Commons.

February 27th, 2009

IP restrictions on the Sunlight Labs APIs and associated data sets?

I just posed a question on the sunlightlabs api group (ok to push data sets and APIs to Freebase.com? – Sunlight Labs API Discussion | Google Groups):

My question is whether it's ok for me to upload some of the data I can get from http://services.sunlightlabs.com/api/ to freebase.com. Freebase then makes its collection of data available under a variety of licenses, including the CC-BY license (http://www.freebase.com/signin/licensing) .

I don't see any restrictions in the ToS against doing so. Moreover, I don't see any statement of how the data is licensed — if at all. What statement of copyright is there?

I'll update here on what I hear back.

February 27th, 2009

What I hope to learn at the Freebase Build-A-Base meeting tomorrow

I've been thinking about how to prepare for  tomorrow's Build-A-Base tutorial at Freebase. I've already started building two bases:

For the PolDB project, I should sit down to make a schema to model American politicians, first by identifying relevant ones already in Freebase and then finding one or two that are not currently in Freebase or are woefully inadequate.    Tomorrow, I want  to hear any war stories around using Freebase WEX, crawling government databases, pushing public domain info into Freebase by users.

On the data modeling front, I'd like to learn techniques for evolving schemas and how to involve the Freebase community in helping out.   Perhaps data models for legislators might be a well understood topic; I should  ask on the openhouse list,  I should also ask about the state level and postpone municipal to the future.

On the History of Art front, I'm interested in building some sort of history of art tools and/or  website aimed at improving art history learning and teaching — both in the classroom and in informal settings.  I'm currently crafting a proposal for the NEH Digital Humanities Start-Up program , which is due on Tuesday, April 8.  My ideas are still forming but I'd like to build a "semantic" open history of arts database useful for learning/teaching the history of art — using Freebase.

I know that there's a good start in Freebase Visual Arts Commons. My working assumption is that I can build on top of that — meaning that

  1. the visual arts commons (actually, what's the difference between a "commons" and a "base" — is a commons an officially sanctioned base?) is the place to start from and contribute to
  2. that the commons is a pretty good corpus — something I have to check.  That is, is it comprehensive enough for people to start to ask meaningful questions and get reasonable answers using the data already there + maybe a bit of supplementary data entry work.

With those assumptions in mind,   I'd like to

  1. work with people who have relevant data to try to convince them to give some of it to Freebase. That might take some doing, but perhaps some museums would be willing to contribute a subset of data once they see some benefits
  2. build services and tools to help in the learning and teaching aspects of the history of art

Some possible ideas:

  • tools to help people review facts in the history of arts — maybe a slide reviewer / guessing game
  • tools to let art history instructors integrate timelines, little JavaScript widgets representing art works, artists, periods in the context of their own websites.

I'm trying to garner support for this project in the Freebase and art history community.

February 27th, 2009

Plotting political boundaries on Google Maps

As I start to develop a database of politicians for a prototype of PolDB that I'm developing for the Apps for America contest, I will likely be using Google Maps to display state, county, and census boundaries. A good example of such maps to study is Webfoot Census Maps, which I found via Google Maps Mania: U.S. Census Bureau + Google Maps.

Of immediate interest to me is whether I can use Webfoot's Mapeteria: Map Colouring to quickly reproduce such maps as Stimulus Legislation, Breakdown by States – The Wall Street Journal Online, which color the US states according to some scalar value. I also wondered whether I can do almost as well using the Google Charts API, which I wrote about on ProgrammableWeb last year: Google Chart API’s New Schematic Maps.

Besides a potential quick win from using Mapeteria, I am looking in the medium term into techniques for creating Google Maps out of public data stored in the shapefile format. For example, I downloaded a shapefile for US states, which I now hope to convert to KML with fwtools.

In the longer term, I wonder whether it'd make sense to have alternatives besides Google Maps.  One that caught my eye recently is CloudMade » Introducing the CloudMade Developer Zone, which builds upon OpenStreetMap data.