Computing Bacon Numbers
I spent the past week doing Google Foobar exercises, and that was really fun (and really addictive). I managed to stop doing that to start working on a fun project: calculating a given person’s Bacon number.
First version
My first idea was to parse IMDb pages to extract the information I would need. I discovered Beautiful Soup, which is amazing and super easy to use. To find out what movies an actor was in all I needed to do was search for /title/ in that actor’s IMDb page, and to find what actors were in a movie I searched for /name/, and it works! After 2 days of working in this version, I had a Command Line Interface that could tell if an actor’s Bacon number was 1 or greater.
import urllib2
import re
from BeautifulSoup import BeautifulSoup
def find_movies(actor_url):
"""
Function that receives an actor m.imdb url and returns urls for all the movies
the actor was in
"""
page = urllib2.urlopen(actor_url)
data = page.read()
soup = BeautifulSoup(data)
movies = []
for link in soup.findAll('a', href = re.compile('/title/')):
movies.append(("http://m.imdb.com" + str(link.get('href'))))
return movies
def find_actors(movie_url):
actors = []
page = urllib2.urlopen(movie_url)
data = page.read()
soup = BeautifulSoup(data)
for link in soup.findAll('a', href = re.compile('http://m.imdb.com/name/nm')):
actors.append(str(link.get('href')[:33]))
return actors
def get_neighbors(actor_url):
neighbors = []
for movie in find_movies(actor_url):
actors = find_actors(movie)
for x in actors:
if x not in neighbors:
neighbors.append(x)
return neighbors
def get_graph(actors_urls):
graph = {}
for actor in actors_urls:
print actor
graph[actor] = get_neighbors(actor)
return graph
Bacon = "http://m.imdb.com/name/nm0000102/filmotype/actor?ref_=m_nmfm_1"
def bacon_identifier_function(actor_url, graph):
if actor_url in graph[Bacon]:
return "1"
else:
return "Bigger than 1"
grafo = get_graph([Bacon] + get_neighbors(Bacon))
# Command Line Interface
while True:
actor = raw_input("Please insert an actor imdb identifier, or q to quit ")
if actor == "q":
break
print bacon_identifier_function("http://m.imdb.com/name/" + actor + "/", grafo)
But this version was just too slow to do anything other than telling if someone did a movie with Kevin Bacon or not.
Final version
Turns out IMDb lets a subset of its data available as plain text files, and everything I needed was in there. So I could create my graph by parsing text files. At first I was trying to have a graph a just the actors, and 2 actors were connected if they’d done a movie together. To do that I would have to look at every pair of actors and connect them if they have a movie in common. But I don’t need the graph to be this way in order to calculate someone’s Bacon number. If I have a graph in which an actor is connected to every movie he made and a movie is connected to every actor in it, the shortest path between an actor and Kevin Bacon will be twice that actor’s Bacon number. And I can generate this graph in a couple minutes.
Finding the shortest path
So now all I have to do is find the shortest path between 2 nodes in a graph. Since all edges have the same weight, I can solve this with a Breadth-First Search:
def BFS(graph, source):
line = deque()
distance = {source : 0}
line.append(source)
while line:
t = line.popleft()
for e in graph[t]:
if e not in distance:
distance[e] = distance[t] + 1
line.append(e)
return distance
So if I pass to BFS my graph with all the actors and movies and source = Bacon, I get a dictionary:
doubled_bacon_numbers = BFS(graph, Bacon)
And a someone’s Bacon number is just:
doubled_bacon_numbers[someone]/2
The final version of my code is here
tags