0

Dataset

I have a movie dataset where there are over half a million rows, and this dataset looks like following (with made-up numbers)

MovieName  Date           Rating      Revenue
A          2019-01-15     3           3.4 million
B          2019-02-03     3           1.2 million
...        ...            ...         ...

Object

Select movies that are released "closed enough" in terms of date (for example, the release date difference of movie A and movie B is less than a month) and see when the rating is same, how the revenue could be different.

Question

I know I could write a double loop to achieve this goal. However, I am doubting this is the right/efficient way to do, because

  • Some posts (see comment of @cs95 to the question) suggested iterating over a dataframe is "anti-pattern" and therefore something not advisable to do.
  • The dataset has over half a million rows, I am not sure if writing double loop is something efficient to do.

Could someone provide pointers to the question I have? Thank you in advance.

Mr.Robot
  • 349
  • 1
  • 16

1 Answers1

1

In general, it is true that you should try avoiding loops when working with pandas. My idea is not ideal, but might point you in the right direction:

  1. Retrieve month and year from the date column in every row to create new columns "month" and "year". You can see how to do it here
  2. Afterwards, group your dataframe by month and year (grouped_df = df.groupby(by=["month","year])) and the resulting groups are dataframe with movies from the same month and year. Now it's up to you what further analysis you want to perform, for example mean (grouped_df = df.groupby(by=["month","year]).mean()), standard deviation or something more fancy with the apply() function.

You can also extract weeks if you want a period shorter than a month

jared3412341
  • 197
  • 3
  • 12