higher education

Cool to see a digital historian explain screen-scraping

I'm adding Digital History Hacks to my list of weblogs to follow on the strength the author (William J. Turkel) 's being a historian working in "digital history" and writing about web spidering and scraping. To wit, Digital History Hacks: Teaching Young Historians to Search, Spider and Scrape:

    To get the most out of the web, however, it is crucial that we begin to teach history students the rudiments of web programming. Spidering, for example, is the (automated) process of visiting a webpage, creating an index and a list of links to further pages, and then following each of those in turn and doing the same thing. Whenever we follow the citations in a footnote to another source, and then begin to read its footnotes, we are doing a kind of spidering. By teaching students how to implement this process on the computer we will not only teach them a crucial skill, we will make them more aware of the technologies that have long underlain the historian's craft. Scraping refers to the process of mechanically extracting information from sources (like webpages) that are intended to be read by people rather than machines. Because computers don't understand text in the way that people do, scraping has to rely on the form of the text to extract information, rather than the meaning. As a result, scrapers are 'brittle': if the form changes, the scraper breaks. For this reason, it is important for historians to be able to create their own tools, rather than using the tools created by others, and this, again, means that it is necessary to learn some rudimentary web programming.

digital scholarship
higher education
humanities
screen scraping

Comments (0)

Permalink

Large scale IT Trends Facing the University

I identify three trends in IT that will have a large impact on the university:

  • increasingly inexpensive storage, network, and computation power for individuals For $25/year, I am promised unlimited storage and bandwidth for all my photos by Flickr. I can upload all my videos to YouTube or Google Video for free. For $16/month, I have 400 GB of storage and 4TB of monthly bandwidth from dreamhost.com. With this comparatively inexpensive infrastructure, I can create sophisticated web applications that fuse together a vast array of open source libraries and applications, as well as further storage (S3) and computation power (EC2) from amazon.com and a numerous other providers.
  • the rise of peer production/mass collaboration in "Web 2.0". In naming "You" (that is, all the many, typically nameless, individuals who participate on the Web) as Person of the Year, Time summarizes this trend in the following way: "In 2006, the World Wide Web became a tool for bringing together the small contributions of millions of people and making them matter." It is easy to spot the plentiful junk emerging from Web 2.0, yet universities will find it increasingly difficult to dismiss the astounding richness of such entities as the Wikipedia and Flickr.
  • the continued deployment of XML web services XML will continue to be used widely by organizations and, more recently, by individual users. Using service-oriented architectures, organizations/enterprises will re-factor their infrastructure in terms of reusable services that will be accessible through XML web services.

After first dismissing these technology trends as merely faddish, the university community will come to terms with them to take advantage of their positive aspects, adapting them to the university environment, while avoiding the negatives (which are very real, because of the difference in priorities between commercial enterprises and the university)

These technology trends will accentuate the computerization of research in academic disciplines. Some pioneers, especially those in disciplines that have a long history of computation, have already taken advantage of commodity hardware and built extensive computer-based collaborations. Many other researchers will be struggling to use the same technology. I argue that it is in the institution's interests to help all of its members to work at some baseline level. Moreover, there will be challenges, such as the long-term archiving of data, that the university as a whole will have to tackle, creating a demand for architectures and policies to handle these common needs.

The availability of cheap hardware and storage outside the university presents an immediate challenge to university. Many pioneering university members will be tempted to use those systems, because of low prices even if these services are not quite optimized for users' academic needs. Should people at the university be encouraged to use those outside services? Is there a way for the university to purchase those services and adapt them on behalf of the university community? What policies should be put in place concerning the use of outside services? I predict that the university will figure out a combination of industrial partnerships, system integration, and ways to help individuals cobble together the best solutions that will satisfy their research needs and also handle relevant policy issues.

The university community will have its own large collections of data and digital content to handle. Take, for example, the digitization of the UC library, which will result in a collection of millions of digitized books available to the university community. These data present incredible opportunities for education and research, ones that are best exploited if we work together as a community.

This is a great time for the university to develop an information technology architecture to handle these challenges, specifically an SOA that will work for this context.

UC Berkeley
architecture
higher education

Comments (1)

Permalink