Introduction

In this blog we are using Scrapy, an open-source and collaborative web crawling framework for Python, in order to extract some data from https://www.themoviedb.org/ website. The goal is to make recommendations for movies and TV shows based on one of the favorite movie or TV show of the user. In this blog I will be the user. Saying that my favorite TV show is “Westworld”. The scrapy will crawl through each cast member and extract data, names of the TV shows or movies the actors of Westworld had played in, and return a CSV file at the end. This file will be used to make a prediction of what other TV shows or movies I might like based on the number of shared actors that also played in Westworld.

How Scraper Works

First we need to creat a scrapy project which can be done by running the following command in the python shell.

scrapy startproject TMDB_scraper
cd TMDB_scraper

Next we will create a spider that crawls through the webpage. This can be done by creating a .py file inside the spiders directory and add the following lines:

# to run 
# scrapy crawl tmdb_spider -o movies.csv

import scrapy

class TmdbSpider(scrapy.Spider):
    name = 'tmdb_spider'
    
    start_urls = ['https://www.themoviedb.org/tv/1411-person-of-interest/']

This block of code ensures that the scrapy package is imported and a new spider class is created that has a unique name and a start url. Next we create three parsing functions to extract the actor names that played in “Westworld” and the names of the movies and TV shows that those actors played in.

parse(self, response) assumes that we start on the favoriate movie webpage in our case on https://www.themoviedb.org/tv/63247-westworld and navigates to Full Cast & Crew webpage. parse function accepts two arguments, the object instance and response. We can apply css selector on response to extract some parts of HTML code. In this case the css selector retrieves the href of <a> tag that is the descendent of new_button class which is a descendant of a tag with id media_v4. The href is then saved in the next_page variable. Since it needs to be appended to the original url to make a valid url, we use response.follow function which does it automatically and calls parse_full_credit function.. The function looks like this:

  def parse(self, response):
    """
    This method finds the hyperlink to the next webpage and 
    passes it to another parser function.
    """

    # this command finds the url to the next page
    next_page = response.css('#media_v4 .new_button a::attr(href)').get()
    # joins the initial link to the url found above
    yield response.follow(next_page, callback=self.parse_full_credit)

parse_full_credits(self, response) starts on the Cast & Crew webpage and uses the css selector to get the first 50 actors that played in “Westworld”. This is done by extracting all href’s of <a> tag that is a child of <li> tag which is a child of <ol> tag and finally which is a child of a tag that has panel and pad class names. In order to make a complete url we need to add the start url to href’s we extracted. This is done inside Request function which makes an http request and calls parse_actor_page to parse through the returned webpage.

def parse_full_credit(self, response):
        """
        This function extracts the links to each cast member and 
        calls parse_actor_page function for further parsing.
        """
        # gets the links to the cast members
        # it limits only to the first 50 cast members
        cast_links=response.css('.panel.pad > ol > li > a::attr(href)').getall()[:50]

        # makes a full url link
        #next_pages = ["https://www.themoviedb.org" + link for link in cast_links]

        # loops over actor pages
        for next_page in cast_links:
            yield Request("https://www.themoviedb.org" + next_page, callback = self.parse_actor_page)

parse_actor_page(self, response) starts on each actor’s webpage and extracts a dictionary with two key value pairs - actor names and movie or TV names. This method creates such entries for each movie or TV show the actor has played in. Actor names are the text between closing and opening <a> tags that are the descendants of <h2> tag. Similarly the movies or TV shows are the text between <bdi> tags that are descendants of <table> tag. Since there is at least one movie or TV show we make a loop to yield for each movie or TV show the actor has played in.

 def parse_actor_page(self, response):
        """
        This function yields a dictionary 
        {"actor" : actor_name, "movie_or_TV_name" : movie_or_TV_name},
        which is a dictionary of the movies or TV shows that 
        the actor has played in.
        """
        
        # getting the actor's name
        actor_name=response.css("h2 a::text").get()
        # getting a list of movies that the actor played in
        movie_or_TV_names=response.css("table bdi::text").getall()

        # iterating through movies and yielding a dictionary
        for movie_or_TV_name in movie_or_TV_names:
            yield {
                "actor": actor_name,
                "movie_or_TV_name": movie_or_TV_name
            }

We are done with the implementation of spider object. Now we run this command to extract the data and save it in result.csv file.

scrapy crawl tmdb_spider -o results.csv

Here is the link to the GitHub repository that houses the entire code: https://github.com/Aram-1999/PIC16B_HW2

Making Recommendations

Since we have the data, we can now make recommendations by modifying our data using python. We find the shared actors for each movie or TV show and the ones with hightest number of shared actors (at least 3) are recommended.

import numpy as np
import pandas as pd
import sqlite3

df = pd.read_csv("results.csv")
df.head(100)

	actor	movie_or_TV_name
0	Evan Rachel Wood	The Adults
1	Evan Rachel Wood	All That I Am
2	Evan Rachel Wood	Queen
3	Evan Rachel Wood	One Thousand Paper Cranes
4	Evan Rachel Wood	Weird: The Al Yankovic Story
...	...	...
95	Daniel Wu	Tai Chi Hero
96	Daniel Wu	Tai Chi Zero
97	Daniel Wu	The Last Supper
98	Daniel Wu	Inseparable
99	Daniel Wu	The Great Magician

100 rows × 2 columns

We make an SQL database called movie_or_TV and create a table called movie inside that contains all the data we extracted.

conn = sqlite3.connect('movie_or_TV')
#df.to_sql("movie", conn, index=False)

Then we find the number of shared actors for each movie or TV show in our movie table.

cmd = f"""
SELECT movie_or_TV_name, COUNT(actor) number_of_shared_actors
FROM movie
GROUP BY movie_or_TV_name
ORDER BY number_of_shared_actors DESC
"""

ordered_df = pd.read_sql(cmd, conn)
conn.close()

ordered_df.head()
recommendations = ordered_df[ordered_df['number_of_shared_actors'] >= 3]

We use plotly graphing library to make a table that lists the recommended movies or TV shows based on the number of shared actors in descending order.

import plotly.graph_objects as go
import pandas as pd

fig = go.Figure(data=[go.Table(
    header=dict(values=list(recommendations.columns),
                fill_color='paleturquoise',
                align='left'),
    cells=dict(values=[recommendations.movie_or_TV_name,
                      recommendations.number_of_shared_actors],
               fill_color='lavender',
               align='left'))
])

fig.show()