-1

I have a file. I am splitting it in a class. Also, I want to return top n years having the highest number of movie produced. And I will use lines attricute to get data.

import re

import collections

 

class movie_analyzer:

    def __init__(self,s):

            self.lines=open(s, encoding="latin-1").read().split('\n')

            self.lines=[x.split('::') for x in self.lines]
       

    def freq_by_year(self):

        movies_years = [x[3] for x in self.lines]

        c = collections.Counter(movies_years)      

        for movies_years, freq in c.most_common(3):

            print(movies_years, ':', freq)



movie=movie_analyzer("modified.dat")

movie.freq_by_year()

It gives this error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-627-51913258f9e4> in <module>
----> 1 movie.freq_by_year()
 
<ipython-input-624-8dc663c0b252> in freq_by_year(self)
      9     def freq_by_year(self):
     10 
---> 11         movies_years = [x[3] for x in self.lines]
     12 
     13         c = collections.Counter(movies_years)
 
<ipython-input-624-8dc663c0b252> in <listcomp>(.0)
      9     def freq_by_year(self):
     10 
---> 11         movies_years = [x[3] for x in self.lines]
     12 
     13         c = collections.Counter(movies_years)
 
IndexError: list index out of range    

Also, movie.lines looks like this:

[['1', 'Toy Story', "Animation|Children's|Comedy", '1995'],
 ['2', 'Jumanji', "Adventure|Children's|Fantasy", '1995'],
 ['3', 'Grumpier Old Men', 'Comedy|Romance', '1995'],
 ['4', 'Waiting to Exhale', 'Comedy|Drama', '1995'],
 ['5', 'Father of the Bride Part II', 'Comedy', '1995'],
 ['6', 'Heat', 'Action|Crime|Thriller', '1995'],
 ['7', 'Sabrina', 'Comedy|Romance', '1995'],
 ['8', 'Tom and Huck', "Adventure|Children's", '1995'],
 ['9', 'Sudden Death', 'Action', '1995'],
 ['10', 'GoldenEye', 'Action|Adventure|Thriller', '1995']]

.dat file looks like:

Movies = ["1::Toy Story::Animation|Children's|Comedy::1995\n",

"2::Jumanji::Adventure|Children's|Fantasy::1995\n",

'3::Grumpier Old Men::Comedy|Romance::1995\n',

'4::Waiting to Exhale::Comedy|Drama::1995\n',

'5::Father of the Bride Part II::Comedy::1995\n']

  • 4
    Please include code/errors/textual data as **text**, not images. See https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-on-so-when-asking-a-question – Thierry Lathuille Dec 12 '20 at 17:58
  • 3
    And check your data, you probably have a line with less fields than you expect. – Thierry Lathuille Dec 12 '20 at 17:59
  • Can you provide the content of the "modified.dat" file? – Turtlean Dec 12 '20 at 18:05
  • 3
    [Catch the error](https://docs.python.org/3/tutorial/errors.html#handling-exceptions) and inspect/print relevant data in the except suite. If you are using an IDE **now** is a good time to learn its debugging features Or the built-in [Python debugger](https://docs.python.org/3/library/pdb.html). Printing *stuff* at strategic points in your program can help you trace what is or isn't happening. [What is a debugger and how can it help me diagnose problems?](https://stackoverflow.com/questions/25385173/what-is-a-debugger-and-how-can-it-help-me-diagnose-problems) – wwii Dec 12 '20 at 18:05
  • I edited error part. – user14800447 Dec 12 '20 at 18:06
  • Also, I provided content of data file. – user14800447 Dec 12 '20 at 18:10
  • I ran your code with a file I reconstructed from `Movies` and it worked for me. Perhaps you are referencing the wrong file. – quamrana Dec 12 '20 at 18:34

1 Answers1

0

I found two potential problems in the __init__ function given your codebase and the .dat file:

def __init__(self, s):

  self.lines = open(s, encoding="latin-1").read().split('\n')

  self.lines = [x.split('::') for x in self.lines]
  self.lines = [l for l in self.lines if len(l) == 4] # <--(1)
  for line in self.lines: # <--(2)
      line[3] = re.sub('\D', '', line[3])

(1) There is an extra line being parsed that just contains the empty character: "". So out of safety you could remove any line that doesn't have exactly the four elements that you are expecting

(2) Some years are parsed wrongly because of non-digits characters attached to it, such as "" or /n. You can curate the years column by using a regex that filters every character that isn't a digit

Turtlean
  • 579
  • 4
  • 9