Python List Append Slow?

Question

i have to merge two text files together into one, and create a new list from that. The first one contains urls and the other one urlpaths/folder, which have to be applied to EVERY url. Im Working with lists, and its really slow, because its roughtly about 200,000 items.

Sample:

urls.txt:

 http://wwww.google.com
 ....

paths.txt:

 /abc
 /bce
 ....

Later, after the loop is finished, there should be a new list with

http://wwww.google.com/abc
http://wwww.google.com/bce

Python Code:

URLS_TO_CHECK = [] #defined as global, needed later

def generate_list():
  urls = open("urls.txt", "r").read().splitlines()
  paths = open("paths.txt", "r").read().splitlines()
  done = open("done.txt", "r").read().splitlines() #old done urls

  for i in range(len(urls)):
    for x in range(len(paths)):
        url = re.search('(http://(.+?)....)', urls[i]) #needed
        url = "%s%s" %(url.group(1), paths[x])
        if url not in URLS_TO_CHECK:
            if url not in done:
                URLS_TO_CHECK.append(url) ##<<< slow!

Already read some other threads about map function, disable gc, but cant use map function with my program. and disable gc didn't really help.

Try changing URLS_TO_CHECK to a set() instead, as long as ordering doesn't matter to you. — Ben, Apr 23 '15 at 16:38
You could make `done` a set too, which would speed up the membership test. Have you done any profiling? — jonrsharpe, Apr 23 '15 at 16:39
Have you even `profile`d your code? You are creating a lot of lists (`range(len(url))` etc.) , read all files to the memory instead of iterating over them, and honestly I think `append` is not a bottleneck here. — Łukasz Rogalski, Apr 23 '15 at 16:39
If `urls` is 200,000 elements long and so is `paths`, it's not very surprising that your program is slow. That's 40 trillion iterations. — Kevin, Apr 23 '15 at 16:44
`URLS_TO_CHECK` and `done` could be dictionaries , so search time is reduced from O(N) to O(1) — Jose Ricardo Bustos M., Apr 23 '15 at 16:48
You should also try to `generate` urls instead of building whole list. I'll draft some improvements and post a whole answer later. — Łukasz Rogalski, Apr 23 '15 at 16:52

score 1 · Accepted Answer · answered Apr 23 '15 at 18:06

This approach takes advantage of things such as:

quick look-up in set - O(1) instead of O(n)
generating values on demand instead of building whole list as once
reading from file in chunks instead of loading up whole data at once
avoiding unnecessary regular expression

def yield_urls():
    with open("paths.txt") as f:
        paths = f.readlines() # needed in each iteration and iterates over, may be list

    with open("done.txt") as f:
        done_urls = set(f.readlines()) # needed in each iteration and looked up, set is O(1) vs O(n) in list 

    # resources are cleaned up after with

    with open("urls.txt", "r") as f:
        for url in f: # iterate over list, not big list of ints generated before iteratiob, much quicker
            for subpath in paths:
                full_url = ''.join((url[7:], subpath)) # no regex means faster, maybe string formatting is quicker than join, you need to check
                # also, take care about trailing newlines in strings read from file
                if full_url not in done_urls:  # fast lookup in set
                    yield full_url  # yield instead of appending

# usage
for url in yield_urls():
    pass  # to something with url

Don't call `readlines` unless you need its argument - just use `list(f)` and `set(f)` instead. You're also using `''.join` where you should just use `+`: `url[7:] + subpath`. — Veedrac, Apr 24 '15 at 05:15

score 0 · Answer 2 · edited May 23 '17 at 12:22

0

Search in Dictionaries is faster compared lists Python: List vs Dict for look up table

URLS_TO_CHECK = {} #defined as global, needed later

def generate_list():
  urls = open("urls.txt", "r").read().splitlines()
  paths = open("paths.txt", "r").read().splitlines()
  done = dict([(l, True) for l in open("done.txt", "r").read().splitlines()]) #old done urls

  for i in range(len(urls)):
    for x in range(len(paths)):
      url = re.search('(http://(.+?)....)', urls[i]) #needed
      url = "%s%s" %(url.group(1), paths[x])
      if not url in URLS_TO_CHECK:
        if not url in done:
          URLS_TO_CHECK[url] = True #Result in URLS_TO_CHECK.keys()

edited May 23 '17 at 12:22

Community

1
1

answered Apr 23 '15 at 16:52

Jose Ricardo Bustos M.

8,016
6
40
62

`urls = open("urls.txt", "r").read().splitlines()` and `paths = open("paths.txt", "r").read().splitlines()` using excessive use memory better use iterator – Jose Ricardo Bustos M. Apr 23 '15 at 17:00
if use memory size is not a problem Might as well leave – Jose Ricardo Bustos M. Apr 23 '15 at 17:01

Joran Beasley · Answer 3 · 2015-04-23T16:43:25.257

-1

 URLS_TO_CHECK = set(re.findall("'http://(.+?)....'",open("urls.txt", "r").read()))
 for url in URLS_TO_CHECK:
     for path in paths:
         check_url(url+path)

will probably be much faster ... and I think its essentially the same ....

edited Apr 23 '15 at 16:43

answered Apr 23 '15 at 16:40

Joran Beasley

110,522
12
160
179

is this not the same? – Joran Beasley Apr 23 '15 at 16:41
Cant do that, because i have to add the paths (from paths.txt) to every url in urls.txt. – matwet Apr 23 '15 at 16:43

Python List Append Slow?

3 Answers3