Skip to content

Cool to see a digital historian explain screen-scraping

I'm adding Digital History Hacks to my list of weblogs to follow on the strength the author (William J. Turkel) 's being a historian working in "digital history" and writing about web spidering and scraping. To wit, Digital History Hacks: Teaching Young Historians to Search, Spider and Scrape:

    To get the most out of the web, however, it is crucial that we begin to teach history students the rudiments of web programming. Spidering, for example, is the (automated) process of visiting a webpage, creating an index and a list of links to further pages, and then following each of those in turn and doing the same thing. Whenever we follow the citations in a footnote to another source, and then begin to read its footnotes, we are doing a kind of spidering. By teaching students how to implement this process on the computer we will not only teach them a crucial skill, we will make them more aware of the technologies that have long underlain the historian's craft. Scraping refers to the process of mechanically extracting information from sources (like webpages) that are intended to be read by people rather than machines. Because computers don't understand text in the way that people do, scraping has to rely on the form of the text to extract information, rather than the meaning. As a result, scrapers are 'brittle': if the form changes, the scraper breaks. For this reason, it is important for historians to be able to create their own tools, rather than using the tools created by others, and this, again, means that it is necessary to learn some rudimentary web programming.

Post a Comment

You must be logged in to post a comment.