0

I am using RegEx to extract some data from a txt file. I've made the below for-loops to extract emails and birthdates and (tried) to append the outputs to a list. But when I print my list only the first appended output is printed. The birtdate RegEx works fine when run by itself. I'm sure I'm doing something very basic wrong.

f = open("/Users/me/Desktop/scrape.txt", "r", encoding="utf8")

list = []

for i in f:
    if re.findall(r"((?i)[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.])", i):
        list.append(i)

for k in f:
    if re.findall(r'\d\d-\d\d-\d\d\d\d', k):
        list.append(k)

print(list)
f.close()
pam_param
  • 162
  • 1
  • 13
  • Not an answer but just noticing that you are using the case-insensitive modifier `(?i)` in your first pattern. So you could get rid of `A-Z`. Also in your second regex > `\d\d\d\d` is better written `\d{4}` – JvdV Apr 10 '20 at 14:17
  • Does this answer your question? [Read multiple times lines of the same file Python](https://stackoverflow.com/questions/26294912/read-multiple-times-lines-of-the-same-file-python) – azro Apr 10 '20 at 14:17
  • your iterator `f` has reached the end of file (EOF) already when you're entering the second loop. So you either need to do `f.seek(0)` before the second loop, or just `|` two regexes, I think piping two regexes should work just fine – Javed Apr 10 '20 at 14:18

2 Answers2

1

Try this:

with open("/Users/me/Desktop/scrape.txt", "r", encoding="utf8") as f:
    i = f.readline()
    if re.findall(r"((?i)[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.])", i):
        list.append(i)
    if re.findall(r'\d\d-\d\d-\d\d\d\d', k):
        list.append(i)

in your code, after the first for loop, f is now pointing to the end of the file and so the second for loop doesn't "run" as you're intending it to run.

So to modify your code to get it to do what you intended you would close file after first loop and reopen it before second loop so that the file pointer f points to begining of file again:

f = open("/Users/me/Desktop/scrape.txt", "r", encoding="utf8")

list = []

for i in f:
    if re.findall(r"((?i)[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.])", i):
        list.append(i)

f.close()

f = open("/Users/me/Desktop/scrape.txt", "r", encoding="utf8")
for k in f:
    if re.findall(r'\d\d-\d\d-\d\d\d\d', k):
        list.append(k)

print(list)
f.close()
abhinonymous
  • 329
  • 2
  • 13
  • 3
    Please when answering, explain to the OP it's error, and how do your code can fix it. The main goal of SO is to make people learn stuff, not copy code that just work – azro Apr 10 '20 at 14:18
1

You try to read the same file twice. The second for-loop will not do anything. Have a look at this to understand:

f = open("/Users/me/Desktop/scrape.txt", "r", encoding="utf8")
print(list(f))
print("second time:")
print(list(f))

Output:

['1234567890abcdefghijklmopqrstuvwxyz'] # or whatever your content is :)
second time:
[]

To fix this you can store the result of the file in a list (if you are not dealing with huge files, of course):

f = open("/Users/me/Desktop/scrape.txt", "r", encoding="utf8")
content = list(f)


for i in content:
   ... 

for k in content:
   ... 

In your specific example it would be cleaner (and faster) to do all processing in a single for-loop, though. However, the mistake was to try to read twice from the same file without resetting it.

Lydia van Dyke
  • 2,466
  • 3
  • 13
  • 25