Python searching through a txt file for urls

Question

I'm writing lots of urls into a txt file like this inside my script with a loop:

fwrite = open('visited.txt', 'a')
fwrite.write('\n{0}'.format(url))
fwrite.close()

Then when i re-run later i don't want to process visited links so i do this:(visit is a list of new/old urls)

for x in visit:
    if x in open('visited.txt').read().lstrip('\r\n'):
        visit.remove(x)
    else:
        continue

But this always skips half of the lines. If there are 1000 urls, it removes only 500 of it. tried both lstrip/rtsrip with \n and \r\n but couldn't manage it

"You are modifying the contents of the object `visit` that you are iterating over when you do `visit.remove(x)`" -- don't do that — chickity china chinese chicken, Jan 24 '19 at 20:07
Also, you shouldn't open the visited.txt file, write a line and close it everytime you wish to add a url. Either use `with open('visited.txt', 'a') as f: f.write('{0}\n'.format(url))` or collect all the required urls in a list and write it once to the file. — Shirkan, Jan 24 '19 at 20:12

Shirkan · Answer 1 · 2019-01-24T20:07:17.413

1

Read the lines only one time into a list:

with open('visited.txt', 'r') as f:
    visited = f.readlines()

If you wish to keep only the non visiting, you can convert both lists to sets and subtract one from another, then convert back to list:

non_visited = list(set(visit) - set(visited))

edited Jan 24 '19 at 20:07

answered Jan 24 '19 at 20:02

Shirkan

859
1
9
14

still getting the half of it – ggnoredo Jan 24 '19 at 20:05
I edited my answer which was incorrect, try now. – Shirkan Jan 24 '19 at 20:09
thanks for your suggestion but that made a totally random list. I mean the list should be in order but this makes it random – ggnoredo Jan 24 '19 at 20:36
oh, you didn't state this. in this case, @wjandrea answer will work. – Shirkan Jan 24 '19 at 20:37

wjandrea · Accepted Answer · 2019-01-24T21:13:14.010

1

This is a duplicate of Python for loop skipping every other loop?, but for clarity here's a solution for this case:

with open('visited.txt') as f:
    visited = f.read().splitlines()

visit = [url for url in visit if url not in visited]

By the way, your first snippet is easier with context management, and I rearranged the \n since newlines are line terminators, not separators, especially on Unix-like OS's:

with open('visited.txt', 'a') as fwrite:
    fwrite.write('{0}\n'.format(url))

edited Jan 24 '19 at 21:13

answered Jan 24 '19 at 20:09

wjandrea

28,235
9
60
81

1

i had to use f.read().sptlilines() but thanks for your answer, it explained a lot – ggnoredo Jan 24 '19 at 20:33
@ggnoredo Welcome! I forgot `readlines` includes newlines, so I edited that bit. – wjandrea Jan 24 '19 at 21:13

Python searching through a txt file for urls

2 Answers2