0

I want to get 20 most common words from the descriptions of top 10 longest movies from data.csv, by using Python. So far, I got top 10 longest movies, however I am unable to get most common words from those specific movies, my code just gives most common words from whole data.csv itself. I tried Counter, Pandas, Numpy, Mathlib, but I have no idea how to make Python look exactly for most common words in the specific rows and column (description of movies) of the data table

My code:

import pandas as pd
import numpy as np
df = pd.read_csv("data.csv")
small_df = df[['title','duration_min','description']]
result_time = small_df.sort_values('duration_min', ascending=False)
print("TOP 10 LONGEST: ")
print(result_time.head(n=10))

most_common = pd.Series(' '.join(result_time['description']).lower().split()).value_counts()[:20]
print("20 Most common words from TOP 10 longest movies: ")
print(most_common)

My output:

TOP 10 LONGEST: 
                             title  duration_min                                        description
6840        The School of Mischief         253.0  A high school teacher volunteers to transform ...
4482                No Longer kids         237.0  Hoping to prevent their father from skipping t...
3687            Lock Your Girls In         233.0  A widower believes he must marry off his three...
5100               Raya and Sakina         230.0  When robberies and murders targeting women swe...
5367                        Sangam         228.0  Returning home from war after being assumed de...
3514                        Lagaan         224.0  In 1890s India, an arrogant British commander ...
3190                  Jodhaa Akbar         214.0  In 16th-century India, what begins as a strate...
6497                  The Irishman         209.0  Hit man Frank Sheeran looks back at the secret...
3277      Kabhi Khushi Kabhie Gham         209.0  Years after his father disowns his adopted bro...
4476  No Direction Home: Bob Dylan         208.0  Featuring rare concert footage and interviews ...
20 Most common words from TOP 10 longest movies: 
a        10134
the       7153
to        5653
and       5573
of        4691
in        3840
his       3005
with      1967
her       1803
an        1727
for       1558
on        1528
their     1468
when      1320
this      1240
from      1114
as        1050
is         988
by         894
after      865
dtype: int64

Here is the data table: https://www.dropbox.com/s/hxch4v08bkthvz1/data.csv?dl=1

  • Do you have to use pandas? If using Pandas have you reviewed the [Indexing and selecting data](https://pandas.pydata.org/docs/user_guide/indexing.html) section of the user guide? Which part of the filtering/selection is giving you problems? – wwii Mar 03 '22 at 20:04
  • In Python, the easiest way to count words is to create a dictionary where the words are the key, and their frequency is the value. In your case, all you gotta to is to iterate over the `small_df` and `split()` the field `description` of each line. And then run a frequency counting of each word found. – Hilton Fernandes Mar 03 '22 at 20:04

1 Answers1

0

You can select the first 10 rows of your dataframe with iloc[0:10].

In this case, the solution would look like this, with the least modification to your existing code:

import pandas as pd
import numpy as np    
df = pd.read_csv("data.csv")
small_df = df[['title','duration_min','description']]
result_time = small_df.sort_values('duration_min', ascending=False)
print("TOP 10 LONGEST: ")
print(result_time.head(n=10))

most_common = pd.Series(' '.join(result_time.iloc[0:10]['description']).lower().split()).value_counts()[:20]
print("20 Most common words from TOP 10 longest movies: ")
print(most_common) 
hajben
  • 311
  • 1
  • 7