0

1) I open a file

2) run re.findall() returns a list as expected

3) then I run re.findall() again looking for something else but it returns empty list.

But if I open the file again between 2 and 3, the second re.findall() works perfectly.

I can't figure out what is going on, is re closing the file? or is something else happening?

Thanks for any help you may have in advance!

Here's my code

def extract_names(filenames):
  for f in filenames: #grabs one file at a time
    file = open(f, 'r') #opens file

    #find year <h3 align="center">Popularity in 1992</h3>
    year = re.search(r'Popularity\sin\s\d{4}', file.read()) 
    print(year)

    file = open(f, 'r') #reopen file

    #find <tr align="right"><td>1</td><td>Michael</td><td>Ashley</td>
    rank_names = re.search(r'<td>\d*</td><td>\w*</td><td>\w*</td>', file.read())
    print(rank_names)
YCFlame
  • 1,251
  • 1
  • 15
  • 24
Robert M.
  • 3
  • 2
  • 1
    First use an Html parser. About your problem, you can't use `file.read()` two times for the same filehandler (since once it has been read a first time, there's nothing more to read the second time). – Casimir et Hippolyte Apr 06 '16 at 02:20

3 Answers3

1

file.read() consumes the whole file and advances the file pointer to the end of the file. A subsequent call to file.read() just returns the empty string (because the file is already consumed). You could call file.seek(0) to return the file pointer to the beginning of the file, but it's silly to read the file twice when you can just read once and store the contents to avoid extra system calls.

If you want to search the file data more than once, store the result of file.read() and use it instead of file.read() in your calls, e.g.:

filedata = file.read()  # Cache once

year = re.search(r'Popularity\sin\s\d{4}', filedata)  # Search in cache
print(year)

#find <tr align="right"><td>1</td><td>Michael</td><td>Ashley</td>
rank_names = re.search(r'<td>\d*</td><td>\w*</td><td>\w*</td>', filedata)  # Search cache again
print(rank_names)

Side-note: Use a real HTML parser

Community
  • 1
  • 1
ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
1

The file pointer has been moved to the end of the file after file.read() so you cannot use file.read() again to get the whole content of this file.

You may store the content of the file for following operations:

content = file.read()
year = re.search(r'Popularity\sin\s\d{4}', content)
rank_names = re.search(r'<td>\d*</td><td>\w*</td><td>\w*</td>', content)
file.close()

And with keyword is advised to use for file operations which can close the file handler automatically:

for f in filenames:
    with open(f, 'r') as file:
         content = file.read()
YCFlame
  • 1,251
  • 1
  • 15
  • 24
0

Why dont you name the str that you get out of read().

with open("filename", "rt") as f:
    content = f.read()

Now you can refer to the object as content however many times you wish. The more open() operations the more overhead. And as open() returns an iterator, read([chunk]) consumes it chunk by chunk, until there is nothing left. That's why you got an empty container the second time.

C Panda
  • 3,297
  • 2
  • 11
  • 11