Recently read this great little article about doing map reduce over music. What was interesting is that the author put up a test dataset for people to use, but of course had to caveat it with “don’t abuse”. What’s interesting is that there is a host of interesting datasets out there:
- Some IMDB information (movies)
Could amazon be great and construct a repository for this data so everybody from student researchers to people playing around with the next great startup idea could have access to this information. After all, how many times do people want to start their project with the following steps: Crawl/Download; Format/Extract; … then finally process.
Be great! Offer up a no-cost (or very..very… low cost) access to a bunch of shared data.