Research IT

Film User Reviews Web Scraper

Research IT's drop-in sessions are just one of our many ways of reaching out to researchers across the University. We get some very interesting requests at our sessions and this is a great example of how our Research Software Engineers (RSEs) were able to help out.

Joseph McGonagle, Senior Lecturer in Cultural Studies in the French-speaking World, came to a drop-in session looking for a way to collate user reviews for Austrian auteur director, Michael Haneke's filmography from three popular film websites, IMDB, Rotten Tomatoes and Allociné.

Haneke was of particular interest as he's a very divisive director, especially amongst filmgoers. Previously Joseph had collated the data by hand, copying and pasting the information from the sites in question into an Excel spreadsheet. This is a long and tedious process for one film, let alone all twelve of Haneke's films and possibly others that may be of interest.

It was determined that a programmatic solution would be much more efficient at gathering data, saving lots of time. One of our RSEs developed a script for each website that, with minimal input from the user, would allow various data from each review to be recorded such as the reviewer's username, the date the review was published, the score they gave, the text of the review, whether they were a 'Super Reviewer" etc. The data is formatted and ordered correctly in a CSV file that can be mined for further quantitative analysis, such as trends in the reception of the films, and which got the most reviews.

Screenshot_2019-10-14 Caché - Movie Reviews

The challenge came from the differences between how each website displayed their data, as well as Allociné and Rotten Tomatoes changing how they displayed data in the middle of the project, including suddenly deleting older reviews, removing dates completely and generally making their sites harder to retrieve data from. This meant it was a game of cat and mouse to implement the latest changes made by the sites so that all the necessary data could be collected. This demonstrates just how difficult data collection from the web can be, and how it's always necessary to keep up with such changes, so that datasets can be as complete as possible.

Even though the project was specifically for Michael Haneke's films, the scripts work on any film from the IMDB, Rotten Tomatoes and Allociné sites. If you are interested in collecting user reviews from these sites, the scripts have been open sourced, and can be found on GitHub along with set-up instructions.

If you have any questions about this case study or if you would like to discuss your research project or idea with our team please contact us.