0

I am a python beginner trying to solve some projects at cs50. I did this project. This is a BFS implementation for some movie data from IMDB. I successfully solved it. But my code takes upwards of 5 minutes for some very far away inputs.By far away means for some very far nodes in the graph. Here's my code. Ignore the loading data part focus on the shortest_path method.

import csv
import sys


class Node():
    def __init__(self, state, parent, action):
        self.state = state
        self.parent = parent
        self.action = action
    def __str__(self):
        print(self.state)


# Maps names to a set of corresponding person_ids
names = {}

# Maps person_ids to a dictionary of: name, birth, movies (a set of movie_ids)
people = {}

# Maps movie_ids to a dictionary of: title, year, stars (a set of person_ids)
movies = {}


def load_data(directory):
    """
    Load data from CSV files into memory.
    """
    # Load people
    with open(f"{directory}/people.csv", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        for row in reader:
            people[row["id"]] = {
                "name": row["name"],
                "birth": row["birth"],
                "movies": set()
            }
            if row["name"].lower() not in names:
                names[row["name"].lower()] = {row["id"]}
            else:
                names[row["name"].lower()].add(row["id"])

    # Load movies
    with open(f"{directory}/movies.csv", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        for row in reader:
            movies[row["id"]] = {
                "title": row["title"],
                "year": row["year"],
                "stars": set()
            }

    # Load stars
    with open(f"{directory}/stars.csv", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        for row in reader:
            try:
                people[row["person_id"]]["movies"].add(row["movie_id"])
                movies[row["movie_id"]]["stars"].add(row["person_id"])
            except KeyError:
                pass


def main():
    if len(sys.argv) > 2:
        sys.exit("Usage: python degrees.py [directory]")
    directory = sys.argv[1] if len(sys.argv) == 2 else "large"

    # Load data from files into memory
    print("Loading data...")
    load_data(directory)
    print("Data loaded.")

    source = person_id_for_name(input("Name: "))
    if source is None:
        sys.exit("Person not found.")
    target = person_id_for_name(input("Name: "))
    if target is None:
        sys.exit("Person not found.")

    path = shortest_path(source, target)

    if path is None:
        print("Not connected.")
    else:
        degrees = len(path)
        print(f"{degrees} degrees of separation.")
        path = [(None, source)] + path
        for i in range(degrees):
            person1 = people[path[i][1]]["name"]
            person2 = people[path[i + 1][1]]["name"]
            movie = movies[path[i + 1][0]]["title"]
            print(f"{i + 1}: {person1} and {person2} starred in {movie}")


def shortest_path(source, target):
    """
    Returns the shortest list of (movie_id, person_id) pairs
    that connect the source to the target.

    If no possible path, returns None.
    """
    source = Node(source,None,None)
    queue = list(((Node(neighbor,source,None) for neighbor in neighbors_for_person(source.state) )))
    explored = list()
    path = list()
    while queue:
        current_node = queue.pop(0)
        if current_node.state not in explored:
            if current_node.state[1] == target:
                print('Goal reached')
                goal = current_node
                while goal:
                    path.append(goal.state)
                    goal = goal.parent
                return path[:-1][::-1]
            explored.append(current_node.state)
        queue.extend((Node(neighbor,current_node,None) for neighbor in neighbors_for_person(current_node.state[1])))





def person_id_for_name(name):
    """
    Returns the IMDB id for a person's name,
    resolving ambiguities as needed.
    """
    person_ids = list(names.get(name.lower(), set()))
    if len(person_ids) == 0:
        return None
    elif len(person_ids) > 1:
        print(f"Which '{name}'?")
        for person_id in person_ids:
            person = people[person_id]
            name = person["name"]
            birth = person["birth"]
            print(f"ID: {person_id}, Name: {name}, Birth: {birth}")
        try:
            person_id = input("Intended Person ID: ")
            if person_id in person_ids:
                return person_id
        except ValueError:
            pass
        return None
    else:
        return person_ids[0]


def neighbors_for_person(person_id):
    """
    Returns (movie_id, person_id) pairs for people
    who starred with a given person.
    """
    movie_ids = people[person_id]["movies"]
    neighbors = set()
    for movie_id in movie_ids:
        for person_id in movies[movie_id]["stars"]:
            neighbors.add((movie_id, person_id))
    return neighbors


if __name__ == "__main__":
    main()

What should I try to made it run fast?

martineau
  • 119,623
  • 25
  • 170
  • 301
  • You may be able to optimize your code if you first profile it to determine where it is spending most of its time — which is also where any changes will help the most. See [How can you profile a Python script?](https://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script) – martineau Nov 24 '20 at 17:58
  • Thank You.Will do and post as soon as possible – pratham powar Nov 24 '20 at 18:01
  • I wasn't suggesting that you post the results — just use them to know where you should focus your attention and it would pay off the most. If you can't figure out how to do that (after doing some research), _then_ ask a specific question here about it. – martineau Nov 24 '20 at 18:06

1 Answers1

2

Some hints:

  • not in explored has a time complexity of O(n) (where n is the size of explored), because explored is a list. Instead make it a set, so that not in explored executes in constant time:

    explored = set()
    # ...
    if current_node.state not in explored:
    # ...
    explored.add(current_node.state)
    
    
  • queue.pop(0) has a time complexity of O(n) (where n is the size of queue), because queue is a list. Instead make it a deque, so that queue.popleft() executes in constant time:

    from collections import deque
    queue = deque()
    # ...
    current_node = queue.popleft()
    # ...
    queue.extend((Node(neighbor,current_node,None) for neighbor in neighbors_for_person(current_node.state[1])))
    
trincot
  • 317,000
  • 35
  • 244
  • 286
  • It did go faster. A lot faster.Thank you.But now there is a memory problem. – pratham powar Nov 25 '20 at 02:52
  • Can you suggest something for making it memory efficient.As multiple times the script is terminating with the message Killed.Which I saw means that it is consuming more memory – pratham powar Nov 25 '20 at 02:52
  • 1
    That is indeed the downside of BFS: it needs a lot of memory. But there is one other thing: you currently will visit the same movie multiple times via different persons. You should better not jump from person (via movie) to person, but consider movies to be search nodes as well, so that you can avoid multiple visits to the same movie. So in odd BFS iterations you would be dealing with persons, and in even iterations with movies... They would all go in the queue. For that to work, make a Node type that supports finding neighbors whether the subject is a person or a movie. – trincot Nov 25 '20 at 09:23