April 2009

Previous recommendations would say "open the data" to Recovery.gov

As many have jumped into making recommendations on how Recovery data  should be packaged and disseminated, I'm reminded of some important previous work in this area.

The first is the ACM U.S. Public Policy Committee (USACM) Recommendations on Open Government. I have a tremendous respect for the ACM as "the world’s largest educational and scientific computing society". The ACM U.S. Public Policy Committee (USACM) "serves as the focal point for ACM's interaction with U.S. government organizations, the computing community, and the U.S. public in all matters of U.S. public policy related to information technology."   The policy statement on "open government"  first sets the context for its recommendations:

Individual citizens, companies and organizations have begun to use computers to analyze government data, often creating and sharing tools that allow others to perform their own analyses. This process can be enhanced by government policies that promote data reusability, which often can be achieved through modest technical measures. But today, various parts of governments at all levels have differing and sometimes detrimental policies toward promoting a vibrant landscape of third-party web sites and tools that can enhance the usefulness of government data.

The recommendations  "for data that is already considered public information" are:

  • Data published by the government should be in formats and approaches that promote analysis and reuse of that data.
  • Data republished by the government that has been received or stored in a machine-readable format (such as online regulatory filings) should preserve the machine-readability of that data.
  • Information should be posted so as to also be accessible to citizens with limitations and disabilities.
  • Citizens should be able to download complete datasets of regulatory, legislative or other information, or appropriately chosen subsets of that information, when it is published by government.
  • Citizens should be able to directly access government-published datasets using standard methods such as queries via an API (Application Programming Interface).
  • Government bodies publishing data online should always seek to publish using data formats that do not include executable content.
  • Published content should be digitally signed or include attestation of publication/creation date, authenticity, and integrity.

The second is a set of Open Government Data Principles formulated in October 2007  by the Open Government Working Group,  "30 open government advocates gathered to develop a set of principles of open government data":

Government data shall be considered open if they are made public in a way that complies with the principles below:

1. Complete
All public data are made available. Public data are data that are not subject to valid privacy, security or privilege limitations.
2. Primary
Data are collected at the source, with the finest possible level of granularity, not in aggregate or modified forms.
3. Timely
Data are made available as quickly as necessary to preserve the value of the data.
4. Accessible
Data are available to the widest range of users for the widest range of purposes.
5. Machine processable
Data are reasonably structured to allow automated processing.
6. Non-discriminatory
Data are available to anyone, with no requirement of registration.
7. Non-proprietary
Data are available in a format over which no entity has exclusive control.
8. License-free
Data are not subject to any copyright, patent, trademark or trade secret regulation. Reasonable privacy, security and privilege restrictions may be allowed.

Compliance must be reviewable.

The final is the paper “Government Data and the Invisible Hand.” (Yale Journal of Law & Technology 11: 160.) by David Robinson, Harlan Yu, and Edward Felten.  The abstract contains the following recommendation:

Today, government bodies consider their own websites to be a higher priority than technical infrastructures that open up their data for others to use….It would be preferable for government to understand providing reusable data, rather than providing websites, as the core of its online publishing responsibility.

In  ProgrammableWeb last year, I distilled the paper's argument as follows:

The conclusion is based on a claim that the executive branch is comparatively ineffective at creating tools for presenting data and should therefore leave that work to a private sector (either nonprofit or commercial entities) that is best able to respond to a wide variety of possible uses for government data. That doesn’t mean that the government should provide no user interface to the data, but rather “should focus on creating a simple, reliable and publicly accessible infrastructure that exposes the underlying data.” Fancier interfaces and tools should be built by others.

Moreover, the authors have recommended a specific mechanism for ensuring that the government does not privilege any user interface over their public data infrastructure: “require that federal websites themselves use the same open systems for accessing the underlying data as they make available to the public at large.”

Let me now make sure that these recommendations are at least referenced somewhere at the "National Dialogue" around the Recovery.

Uncategorized

Comments (0)

Permalink

Amazon Web services in education program

Next time I teach my Mixing and Remixing Information course, I'll probably apply for a grant from the AWS in Education program:

AWS in Education provides a set of programs that enable the worldwide academic community to easily leverage the benefits of Amazon Web Services for teaching and research. With AWS in Education, educators, academic researchers, and students can apply to obtain free usage credits to tap into the on-demand infrastructure of Amazon Web Services to teach advanced courses, tackle research endeavors and explore new projects – tasks that previously would have required expensive up-front and ongoing investments in infrastructure.

Amazon

Comments (0)

Permalink

I'll be teaching a seminar on mashups at the Educause 2009 Annual Conference

I'm excited to be teaching a pre-conference seminar at the Educause 2009 Annual Conference. My proposal for running a half-day seminar Creating and Enabling Web Mashups was accepted.  The seminar will take place at Tuesday, November 3, 2009 at 8:30AM.  I'm looking forward to spending some time in Denver.

Here's a short abstract for the session:

There are thousands of web mashups that recombine everything from Google Maps and Flickr with useful data drawn from multiple website.  Mashups are educational, fun, and even transformative.  In this tutorial, you will begin to build mashups that address problems of interest to you.   You will learn how to combine APIs and data into mashups.   You will also learn how to let others recombine content from your website.

Here's a longer abstract:

The Web contains thousands of mashups that recombine everything from Google Maps, Flickr, Amazon.com, the New York Times  with useful information about travel, finance, real estate, and more. By fusing elements from multiple web sites, mashups are often informative, fun, and even transformative — representing the way the Web as a whole is heading.

In this hands-on tutorial, you will learn how to build basic mashups and how to develop mashups to address problems of interest to you.   You will learn how to exploit such web elements as URLs, tags, and RSS feeds in your mashups; and how to combine APIs and data into mashups.   You will also learn how to enable users to recombine content from your website.  Although the most sophisticated mashups demand a wide range of technical knowledge, anyone with a solid knowledge of HTML will be able to learn practical skills from this tutorial.

training
tutorial

Comments (0)

Permalink

Congressional Oversight Panel, TARP, and Elizabeth Warren

I wish I had time to follow the TARP carefully — following the Stimulus already keeps busy enough. However, I learned a lot from Jon Stewart's April 15 interview with Elizabeth Warren the head of the Congressional Oversight Panel: Part 1 and Part 2.

government

Comments (0)

Permalink

Participating in the national online dialogue around recovery.gov

Yesterday, I wrote a story on ProgrammableWeb (An Online Dialogue to Shape Recovery.gov) to educate readers on recovery.gov (the government website aimed to let American track the spending of money arising from the  American Recovery and Reinvestment Act of 2009 — the "Stimulus Package")   and to draw attention to a “national dialogue” this week (until May 3) to solicit ideas aimed at answering the key question:

What ideas, tools, and approaches can make Recovery.gov a place where all citizens can transparently monitor the expenditure and use of recovery funds?

I've been reading some of the ideas presented so far and voted on a couple.  I added comments to two so far.   In response to the proposal XML Web Services ("Make recovery data available as a web service via SOAP XML."), I wrote:

I agree that some type of rigorous programmatic interface that allows developers to access the data from recovery.gov is essential. I think that SOAP and associated the rest of WS-* stack might be one way to implement such access mechanisms, but I would not want SOAP to the exclusive protocol used. I would argue, for instance, that a RESTful approach is also an excellent alternative to consider for recovery.gov.

On a front closer to what our work has been about, in response to Making stimulus spending data accessible to the public, I wrote

I'm one of the Berkeley researchers mentioned above involved with making recommendations on how data feeds should be use to make the recovery more transparent (see http://www.ischool.berkeley.edu/newsandevents/news/20090417recoveryguidelines and http://isd.ischool.berkeley.edu/stimulus/2009-029/)

Although some (but not all) agencies receiving and dispersing recovery funds are using feeds in their reporting (see a list that we compiled at http://isd.ischool.berkeley.edu/stimulus/feeds/feeds.html), the best data on dollars appropriated, obligated, or spent is in the Excel spreadsheets. Although there are apparently templates for the reports, they keep changing format and there's nothing to stop agencies from inserting extra fields or omitting other fields. We know this for a fact since we've written programs to scrape the data from the spreadsheets and find it a challenge to keep up with changes that keep breaking our scripts.

The federal government should made the data in the form of XML feeds in the first place (backed by a schema so that we can check that the data is valid), instead of making people who want to use that data scrape it out of Excel in a highly fragile process.

As I wrote yesterday, it will be interesting to see how well the recovery.gov site actually does at aggregating a large number of proposals and surfacing the best ones. Moreover,

government
recovery.gov tracking

Comments (1)

Permalink

Tracking the stimulus/recovery in the news

Over the last couple of months, I've been studying the Stimulus through the lens of the weekly reports published on recovery.gov.   My colleagues Erik Wilde and Eric Kansa (at the School of Information at UC Berkeley) and I  made recommendations on how data feeds should be used to foster transparency around stimulus data,  in addition to developing prototypes of the types of visualizations one could do with such data feeds.   We're continuing work on that front, specifically scraping data currently found in Excel and transforming that data into XML (Atom) feeds.

It is much easier to transform the financial data into visualizations and analyses, once it is in the form of feeds (rather than Excel).   The federal government should made the data in the form of  XML in the first place (backed by a schema so that we can check that the data is valid),  instead of making people who want to use that data scrape the data out of Excel in a highly fragile process.

To discern the meaning of the data we are extracting from various government sites,  I am now trying to keep up with the news around the recovery.  Here are some of the sources I've been tracking so far:

This list represents my current starting points.  I naturally expect to find a lot of other useful sources as I go along.

recovery.gov tracking

Comments (0)

Permalink

Typographical or semantic irregularities at recovery.gov?

Why are there two reports with the same date? This screenshot is from the reports from the Department of Labor on recovery.gov.

Uncategorized

Comments (0)

Permalink

pageid/curid as a unique id for Wikipedia pages

In my learning how to program Freebase, I've come across links to the Wikipedia that make use of a curid parameter.  For example,

http://en.wikipedia.org/wiki/index.html?curid=296716

is the same as

http://en.wikipedia.org/wiki/Daniel_Akaka

At least, the two pages seem to be the same thing as far as I can see.

How to do a lookup btween curid and the page title?  One way is ff we're screen-scraping, the page source of http://en.wikipedia.org/wiki/Daniel_Akaka contains

var wgArticleId = "296716";

And if you go to http://en.wikipedia.org/wiki/index.html?curid=296716 lots of indication of what the title is, including the permanent link (e.g., http://en.wikipedia.org/w/index.php?title=Daniel_Akaka&oldid=278490360)

To dig deeper, I might want to understand the mediawiki data structure and the mediawiki API.

Wikipedia
freebase

Comments (0)

Permalink

I'm confused: how to provide the proper attribution for a CC-license photo in Freebase?

I'm puzzled by how to provide  the correct attribution to derivatives of Creative Commons licensed.  Does one have to track the entire provenace of the object?  I came across this problem when I wanted to upload a photo from the Wikipedia to Freebase.  Here's how I posed my question on the Freebase general support board:

I'd like to upload the latest photo from http://en.wikipedia.org/wiki/File:Garret_Dillahunt.jpg (e.g., http://upload.wikimedia.org/wikipedia/commons/0/0e/Garret_Dillahunt.jpg) to http://www.freebase.com/view/en/garret_dillahunt but am in a quandary about how to do the proper attribution. The photo in question is a derivative (cropping + light adjustment) of http://www.flickr.com/photos/28821738@N05/2843824072/ — which is licensed under a CC-BY-SA license. If I want to use the Wikipedia photo (a deriv of the one in Flickr), who do I credit as the copyright holder? The uploader of the Flickr photo? ( if so, do I enter http://www.flickr.com/people/28821738@N05 or watchwithkristin or Kristin Dos Santos) The Wikipedia? The wikipedia user who made the last derivative?

creative commons
freebase

Comments (0)

Permalink

journalism as an antidote to information overload?

I think that there is certainly an important role for professional journalism, which can act as an invaluable filter. Overload! : CJR:

To win the war for our attention, news organizations must make themselves indispensable by producing journalism that helps make sense of the flood of information that inundates us all.

In the same issue of CJR is a call to visualize news data — Picture This : CJR in reference to the example at Metrics – In the Shadow of Foreclosures – NYTimes.com

journalism

Comments (1)

Permalink