-2

I have the following code, which gives me some unwanted results:

  f = open(filename, 'r')
  year = re.search(r'Popularity in (\d\d\d\d)', f.read())
  namerank = re.findall(r'<tr align="right"><td>(\d+)</td><td>(\w+)</td><td>(\w+)</td>', f.read())
  if year:
    print(year.group(1))
  else:
    print('No year found')
  
  if namerank:
    for rank in namerank:
        print(rank)
  else:
    print('No names found')

Output:

Popularity in 1990
No names found

However, when I add another f = open(filename, 'r') in line 3:

f = open(filename, 'r')
year = re.search(r'Popularity in (\d\d\d\d)', f.read())
f = open(filename, 'r')
namerank = re.findall(r'<tr align="right"><td>(\d+)</td><td>(\w+)</td><td>(\w+)</td>', f.read())

namerank now has the correct data, which prints on my console. Can anybody tell me why I have to open the file twice in order to get the correct data? Is there a better way to write this code?

Jwee
  • 37
  • 4
  • 1
    opening a file will give you a file handle, calling read will read all the data form that file handle resulting in the file handle not having any more data to provide on the read call, since its already read all the data. Generally speaking if you want to do more then one thing with the data then store the data from read in a variable. `my_data = f.read()` then pass my_data to all the things you want to do stuff with the data. – Chris Doyle Jun 27 '20 at 22:30
  • You need to put the file contents into a variable (i.e. `contents = f.read()`) and then pass that variable to `re.search`. – ekhumoro Jun 27 '20 at 22:31

1 Answers1

1

In the first example, the read() reads the entire file. When you try to read() it again you are at the end of the file and so it will not read more. You could use seek(0) to reset it back to the beginning of the file, but that is not your best option. More on that in a minute.

In the second example you're opening file again, so it is pointing back at the beginning of the file again, and then you are re-reading the entire file.

In either of the above options, reading the file twice does not make the most sense.

A quick solution for this is to save the data that you read from the file and then you can parse that data as many times as you need.

f = open(filename, 'r')
saved_data = f.read()
year = re.search(r'Popularity in (\d\d\d\d)', saved_data)
namerank = re.findall(r'<tr align="right"><td>(\d+)</td><td>(\w+)</td><td>(\w+)</td>', saved_data)

Obviously this should not be done with large files since it is loading in all the data from the file at once and holding it in memory.

Also I would recommend using with for files to ensure they get closed/cleaned up.

with open(filename, 'r') as f:
   saved_data = f.read()
year = re.search(r'Popularity in (\d\d\d\d)', saved_data)
namerank = re.findall(r'<tr align="right"><td>(\d+)</td><td>(\w+)</td><td>(\w+)</td>', saved_data)
Glenn Mackintosh
  • 2,765
  • 1
  • 10
  • 18