Data Unbound

Helping organizations access and share data effectively. Special focus on web APIs for data integration.

October 5th, 2009

Web Services for Recovery.gov

Today, my colleagues Erik Wilde, Eric Kansa, and I are pleased to announce our new report "Web Services for Recovery.gov" and its companion website recovery.berkeley.edu.   Last week, the redesign of Recovery.gov was made public to much fanfare.  Recovery.gov is  the U.S. government’s official website for publicly documenting how funds from the American Recovery and Reinvestment Act of 2009 (ARRA) have been allocated and spent.   Our work  focuses on a crucial aspect of Recovery.gov that has yet to receive sufficient attention, namely, how data Recovery Act spending will be made available in machine-readable form for analysis, interpretation, and visualization  by third-party applications. In our report and in our website, we propose a reporting architecture,  created some sample feeds based on that architecture, and demonstrate how that data could be used in a simple map-based mashup.

Here are some highlights from our report, which I quote (with a bit of editing):

  • Design priorities for recovery.gov need to shift from focusing on deploying an attractive Web site toward designing ARRA web services to support reuse of data in third-party applications.
  • These services should allow any party  to receive the complete set of ARRA reporting data in a timely and easily usable manner, so that in principle, the full functionality of Recovery.gov could be replicated by a third party.
  • Our proposed architecture is based on the principles of Representational State Transfer (REST) and always attempting to use the simplest and most widely known and supported technology for any given task.
  • We recommend the feed-based dissemination of ARRA reporting data using the most widely used technologies on the Internet today: HTTP for service access, Atom for the service interface, and XML for the data provided by the service. This approach allowing access from sophisticated server-based applications or from resource-constrained devices such as mobile phones.
  • The manner which data flows from FederalReporting.gov to Recovery.gov is of critical importance. Ideally, Recovery.gov should use Web services offered by FederalReporting.gov.
  • We strongly recommend that Recovery reporting systems adopt the Atom syndication format for feeds.  Feeds represent a major positive development in making government data more open to citizen review and reuse and provide a unique ability to do so by merging utility for humans as well as machines.
  • While not formally standardized, feed autodiscovery is well supported by current browsers and could be implemented reliably with a well-defined set of implementation guidelines for Web pages offered by Recovery.gov.
  • We strongly recommend making feed paging and archiving mandatory, so that the feeds are not just a temporary way of communicating that information has become available. Instead, the feed pages should be available as persistent and permanent access points, so that accessing information via feeds can be done robustly and reliably.
  • ARRA data dissemination services should be more resource-oriented than service-oriented.  XML representations should contain links (in the form of URIs) to related data resources, thereby representing the relationships between the different concepts which are relevant for reporting.
  • The Recovery reporting schema uses many different coding systems and identifiers. Publication of resources related to some of these identifiers will be of great value.  (We list key identifiers in the report.)
  • There are many possible analyses that people may wish to perform on Recovery data,  making it difficult  to accommodate them all. Therefore, querying services should be oriented toward making machine-readable representations of data available, so that third party developers can easily populate their own analysis engines and run their own specialized algorithms on that data.

Erik Wilde has also commented on our report. We welcome and look forward to your feedback.

Finally, we are grateful to the Sunlight Foundation for a grant that helped to support this effort.

March 20th, 2007

Building the Berkeley Technology Platform: A Proposal

The single greatest challenge for UC Berkeley is retaining its pre-eminence as a world-famous university in the face of not only such traditional competitors as Stanford and Harvard but also the myriad distributed groups of individuals and organizations that use the Web to produce and disseminate information. A big lesson of Web 2.0 is the incredible amount of knowledge and skill–available to be harvested and distributed throughout the Berkeley community — our faculty, our students, our staff, our alumni – as well as the world beyond UC Berkeley. To meet that challenge through technology, I would put my focus on building a collaborative platform (both virtual and "in real life") to enable all these people to contribute and work together. And because I do not know all the answers of what to do, I would be encouraging experimentation as well as inviting many people to work with me.

Building services for faculty as researchers and teachers

We need to help our faculty apply computational techniques to their cutting-edge research. To that end, I suggest that we assemble teams that combine disciplinary and IT expertise; create a blend of centralized and discipline-specific computational infrastructure to support research and teaching; forge collaborations among IT organizations, libraries, and educational technologists to tackle institution-wide problems such as institutional repositories; create packages of basic commodity hosting to support research and teaching.

Building a Berkeley Technology Platform (BTP) and an underlying SOA

This is a great time for UC Berkeley to develop an information technology architecture to support deep collaboration, specifically an SOA that will work for this context. Because there is little experience of deploying a SOA at the university, we can start with small pilot projects that emphasize the consumption of web services, followed by the deployment of a small set of web services. For example: a web service that gives the roster of course and another web service that lists the courses a professor is currently teaching. I know that such web services would have an immediate audience. Once we gain experience with web services, we can look at building a larger framework for the deployment and consumption of web services and SOA fashion. At that point, I would advocate for the building of a Berkeley Technology Platform (BTP) that exploits XML and XML web services to create an underlying service-oriented architecture for the campus. By the BTP, I mean the equivalent of the Amazon technology platform, a set of services and infrastructure available to both internal programmers to create web interfaces and access data and for external audiences to build complementary services on top of ones provided by the platform. The BTP would be a rallying point for integration. Departments have data that can be reused by other departments. The Berkeley Technology Platform would provide an integrated framework for that data. Moreover, BTP provides a way for internal and external audiences to come together. The Berkeley platform is an opportunity for collaboration around campus, certainly among application infrastructure and data architects within IST.

In developing the BTP, we should invite students to be active co-developers, to use our web services and show us, what can be done with them. If we are doing things right, we will be surprised by how people will use it. Several years ago, I hired a student who made a name for himself in web scraping the Berkeley course catalog system to create an alternative and reportedly superior, interface. Ideally, we can create our systems so that student should not have to web-scrape our systems, but have an API to access the data and wrap their own interface. I hired that student and wanted to get more students like him. Moreover, from teaching my own course "Mixing and Remixing Information," I know that students who have very little computer skills are capable of building reasonably elaborate systems that bring together disparate elements. There is a lot of talent among students to be tapped.

Building collaboration systems that combine the virtual and the fact that we are also physically co-located

The internet has shown a profound capability for connecting people around the world. I believe that UC Berkeley can better apply networked technologies to supporting collaboration right on campus, where tens of thousands of people are co-located. For example, might it be worthwhile to set up something equivalent to the Stanford Wiki at Berkeley?

Building structures for IT staff to learn from each other

We can do more to enable UC Berkeley IT staff to learn from each other. I myself would like to personally teach a version of the School of Information course I teach on XML and web services to staff on campus. With the right opportunities to learn, mentor, and experiment, the staff will be inspired and empowered to create the elements we need in the BTP.

March 20th, 2007

Large scale IT Trends Facing the University

I identify three trends in IT that will have a large impact on the university:

  • increasingly inexpensive storage, network, and computation power for individuals For $25/year, I am promised unlimited storage and bandwidth for all my photos by Flickr. I can upload all my videos to YouTube or Google Video for free. For $16/month, I have 400 GB of storage and 4TB of monthly bandwidth from dreamhost.com. With this comparatively inexpensive infrastructure, I can create sophisticated web applications that fuse together a vast array of open source libraries and applications, as well as further storage (S3) and computation power (EC2) from amazon.com and a numerous other providers.
  • the rise of peer production/mass collaboration in "Web 2.0". In naming "You" (that is, all the many, typically nameless, individuals who participate on the Web) as Person of the Year, Time summarizes this trend in the following way: "In 2006, the World Wide Web became a tool for bringing together the small contributions of millions of people and making them matter." It is easy to spot the plentiful junk emerging from Web 2.0, yet universities will find it increasingly difficult to dismiss the astounding richness of such entities as the Wikipedia and Flickr.
  • the continued deployment of XML web services XML will continue to be used widely by organizations and, more recently, by individual users. Using service-oriented architectures, organizations/enterprises will re-factor their infrastructure in terms of reusable services that will be accessible through XML web services.

After first dismissing these technology trends as merely faddish, the university community will come to terms with them to take advantage of their positive aspects, adapting them to the university environment, while avoiding the negatives (which are very real, because of the difference in priorities between commercial enterprises and the university)

These technology trends will accentuate the computerization of research in academic disciplines. Some pioneers, especially those in disciplines that have a long history of computation, have already taken advantage of commodity hardware and built extensive computer-based collaborations. Many other researchers will be struggling to use the same technology. I argue that it is in the institution's interests to help all of its members to work at some baseline level. Moreover, there will be challenges, such as the long-term archiving of data, that the university as a whole will have to tackle, creating a demand for architectures and policies to handle these common needs.

The availability of cheap hardware and storage outside the university presents an immediate challenge to university. Many pioneering university members will be tempted to use those systems, because of low prices even if these services are not quite optimized for users' academic needs. Should people at the university be encouraged to use those outside services? Is there a way for the university to purchase those services and adapt them on behalf of the university community? What policies should be put in place concerning the use of outside services? I predict that the university will figure out a combination of industrial partnerships, system integration, and ways to help individuals cobble together the best solutions that will satisfy their research needs and also handle relevant policy issues.

The university community will have its own large collections of data and digital content to handle. Take, for example, the digitization of the UC library, which will result in a collection of millions of digitized books available to the university community. These data present incredible opportunities for education and research, ones that are best exploited if we work together as a community.

This is a great time for the university to develop an information technology architecture to handle these challenges, specifically an SOA that will work for this context.

March 20th, 2007

UC Berkeley's new Chief Technology Architect

Shel Waggener, the CIO of the campus, announced last week the appointment of the new CTA:

I am pleased to announce the appointment of Dr. Hébert Díaz-Flores as the campus's first Chief Technology Architect (CTA). Reporting to me as manager of the Technology Standards, Practices, and Architecture unit, Dr. Díaz-Flores will be the lead architect and evaluator in developing best-practices technology architecture and process assessments for the campus. He will work with Information Services and Technology and campus departments as a key stakeholder to develop and implement appropriate technology solutions.

|