2

i have to merge two text files together into one, and create a new list from that. The first one contains urls and the other one urlpaths/folder, which have to be applied to EVERY url. Im Working with lists, and its really slow, because its roughtly about 200,000 items.

Sample:

urls.txt:

 http://wwww.google.com
 ....

paths.txt:

 /abc
 /bce
 ....

Later, after the loop is finished, there should be a new list with

http://wwww.google.com/abc
http://wwww.google.com/bce

Python Code:

URLS_TO_CHECK = [] #defined as global, needed later

def generate_list():
  urls = open("urls.txt", "r").read().splitlines()
  paths = open("paths.txt", "r").read().splitlines()
  done = open("done.txt", "r").read().splitlines() #old done urls

  for i in range(len(urls)):
    for x in range(len(paths)):
        url = re.search('(http://(.+?)....)', urls[i]) #needed
        url = "%s%s" %(url.group(1), paths[x])
        if url not in URLS_TO_CHECK:
            if url not in done:
                URLS_TO_CHECK.append(url) ##<<< slow!

Already read some other threads about map function, disable gc, but cant use map function with my program. and disable gc didn't really help.

abcd
  • 10,215
  • 15
  • 51
  • 85
matwet
  • 31
  • 4

3 Answers3

1

This approach takes advantage of things such as:

  • quick look-up in set - O(1) instead of O(n)
  • generating values on demand instead of building whole list as once
  • reading from file in chunks instead of loading up whole data at once
  • avoiding unnecessary regular expression

def yield_urls():
    with open("paths.txt") as f:
        paths = f.readlines() # needed in each iteration and iterates over, may be list

    with open("done.txt") as f:
        done_urls = set(f.readlines()) # needed in each iteration and looked up, set is O(1) vs O(n) in list 

    # resources are cleaned up after with

    with open("urls.txt", "r") as f:
        for url in f: # iterate over list, not big list of ints generated before iteratiob, much quicker
            for subpath in paths:
                full_url = ''.join((url[7:], subpath)) # no regex means faster, maybe string formatting is quicker than join, you need to check
                # also, take care about trailing newlines in strings read from file
                if full_url not in done_urls:  # fast lookup in set
                    yield full_url  # yield instead of appending

# usage
for url in yield_urls():
    pass  # to something with url
Łukasz Rogalski
  • 22,092
  • 8
  • 59
  • 93
  • Don't call `readlines` unless you need its argument - just use `list(f)` and `set(f)` instead. You're also using `''.join` where you should just use `+`: `url[7:] + subpath`. – Veedrac Apr 24 '15 at 05:15
0

Search in Dictionaries is faster compared lists Python: List vs Dict for look up table

URLS_TO_CHECK = {} #defined as global, needed later

def generate_list():
  urls = open("urls.txt", "r").read().splitlines()
  paths = open("paths.txt", "r").read().splitlines()
  done = dict([(l, True) for l in open("done.txt", "r").read().splitlines()]) #old done urls

  for i in range(len(urls)):
    for x in range(len(paths)):
      url = re.search('(http://(.+?)....)', urls[i]) #needed
      url = "%s%s" %(url.group(1), paths[x])
      if not url in URLS_TO_CHECK:
        if not url in done:
          URLS_TO_CHECK[url] = True #Result in URLS_TO_CHECK.keys()
Community
  • 1
  • 1
Jose Ricardo Bustos M.
  • 8,016
  • 6
  • 40
  • 62
-1
 URLS_TO_CHECK = set(re.findall("'http://(.+?)....'",open("urls.txt", "r").read()))
 for url in URLS_TO_CHECK:
     for path in paths:
         check_url(url+path)

will probably be much faster ... and I think its essentially the same ....

Joran Beasley
  • 110,522
  • 12
  • 160
  • 179