Comparing 40 million lines in a file with 6 million list items in Python

Question

I have a file with 40 million entries in the form of:

#No Username

And I have a list with 6 million items, where each item is a username.

I want to find the common usernames in the fastest way possible. Here’s what I’ve got so far:

import os
usernames=[]
common=open('/path/to/filf','w')
f=open('/path/to/6 million','r')
for l in os.listdir('/path/to/directory/with/usernames/'):
    usernames.append(l)
#noOfUsers=len(usernames)
for l in f:
    l=l.split(' ')
    if(l[1] in usernames):
        common.write(l[1]+'\n')
common.close()
f.close()

How can I improve the performance of this code?

Note you dont need to loop over os.listdir - just do usernames = os.listdir(...). Its a minor speedup for sure, but cleaner! — Spacedman, Jan 19 '11 at 09:32
What to you mean by 'common usernames'? Most common? Do you want to count how many times each one appears? — TryPyPy, Jan 19 '11 at 09:37
Why do you want to do this in your program, no matter what you do, it still will be painfully slow. You should see if you can use lucene to index your document and use the index to find work frequence. — uncaught_exceptions, Jan 19 '11 at 09:56
No, i want to find the usernames which occur in both . They will never occur more than once. @doc_180, that thought did cross my mind but i was hoping that a small script will get the work done,albeit little slowly. — crazyaboutliv, Jan 19 '11 at 10:14

TryPyPy · Accepted Answer · 2011-01-19T10:33:59.267

2

I see two obvious improvements: first, make usernames a set. Then, create a result list and write '\n'.join(resultlist) to file once.

import os

usernames = []

for l in os.listdir('/path/to/directory/with/usernames/'):
    usernames.append(l)

usernames = set(usernames)

f = open('/path/to/6 million','r')
resultlist = [] 
for l in f:
    l = l.split(' ')
    if (l[1] in usernames):
        resultlist.append(l[1])
f.close()

common=open('/path/to/filf','w')
common.write('\n'.join(resultlist) + '\n')
common.close()

Edit: assuming all you want is to find the most common names:

usernames = set(os.listdir('/path/to/directory/with/usernames/'))
from collections import Counter

f = open('/path/to/6 million')
name_counts = Counter(line.split()[1] for line in f if line in usenames)
print name_counts.most_common()

Edit2: Given the clarification, here's how to create a file that contains names common to the usernames in path and in the 6 million lines file:

import os
usernames = set(os.listdir('/path/to/directory/with/usernames/'))

f = open('/path/to/6 million')
resultlist = [line.split()[1] for line in f if line[1] in usernames]

common = open('/path/to/filf','w')
common.write('\n'.join(resultlist) + '\n')
common.close()

edited Jan 19 '11 at 10:33

answered Jan 19 '11 at 09:30

TryPyPy

6,214
5
35
63

Thanks a lot @try. I wrote the same code that you have written after reading your improvements suggestions. A nOObish question, but , how are sets faster than lists ? – crazyaboutliv Jan 19 '11 at 10:15
For membership tests like `x in y`, sets and dictionaries are much faster than lists and tuples. The difference in speed grows with list/set length: a long set takes virtually the same time as a short one. – TryPyPy Jan 19 '11 at 10:25
I ran the code for a test set having maybe some 3000 files or something and even for that, its been running for 20 minutes now and counting :( :( – crazyaboutliv Jan 19 '11 at 10:28
@Try, oh thanks. I will remember that . I have always used lists, time to change that . Thanks. – crazyaboutliv Jan 19 '11 at 10:28
Which version did you use in the test? How much memory is it using and how much do you have? – TryPyPy Jan 19 '11 at 10:33
I used the first one and that is still running. THen i realized that you have posted a third one and i used that. The third one ran pretty fast, its already done and it seems there are no common users :'( ;'( :'( :'( . I ran the 3rd version on the server and the 1st one on my system and both have behaved nice and not made my system cry , like earlier. – crazyaboutliv Jan 19 '11 at 11:34
Maybe there's a bug in my code... like line[1] containing a newline '"\n"' and the entries in username not having one? – TryPyPy Jan 19 '11 at 11:37
All usernames have a newline ending . They were in the form of urls ("http://twitter.com/abcd") and when i parsed them, i had not removed the new lines. But, just to make doubly sure, i will run it that way too. Thanks. – crazyaboutliv Jan 19 '11 at 11:44
1

I have a doubt though. Essentially, there is no difference between the first and third edition, then, what is the reason for the 3rd one to be so much faster than the 1st one ( keeping aside the fact that the 3rd one ran on double the RAM) – crazyaboutliv Jan 19 '11 at 11:45
Great question. It would make sense if the condition in the 3rd version wasn't appending any name (and was in the 1st), but I can't see that. Maybe the extra RAM made it use virtual memory (i.e., slow disk access)? – TryPyPy Jan 19 '11 at 11:53
Ohh, btw, like there is Eclipse for java, is there something like atleast a code profiler and a next text editor for Python. I am kinda tired of using gedit . – crazyaboutliv Jan 19 '11 at 12:19
See http://stackoverflow.com/questions/81584/what-ide-to-use-for-python , [duh, gedit, remove Windows comment] for Linux, Eric is fast, free and powerful, see also Komodo, Wing Edit or IDE, Eclipse with PyDev are popular. – TryPyPy Jan 19 '11 at 12:28

score 0 · Answer 2 · answered Jan 19 '11 at 09:33

0

If you create a dict with usernames as keys then the algorithm for testing the existence of a key in a dict is much faster than testing for the presence of an element in a list.

answered Jan 19 '11 at 09:33

Spacedman

92,590
12
140
224

Using sets for his purpose is cleaner. – Maxim Sloyko Jan 19 '11 at 09:36

score 0 · Answer 3 · answered Jan 19 '11 at 09:51

If this is an operation you will perform more than once, may I suggest threading? The following is some pseudo-code.

First, split the files up into 100,000 line files in Linux:

> split -l 100000 usernames.txt usernames_

Then, spawn some threads to do this parallel-wise.

 import threading
 usernames_one = set()
 usernames_two = set()
 filereaders = []

 # Define this class, which puts all the lines in the file into a set
 class Filereader(threading.Thread):
  def __init__(self, filename, username_set):
    # while 1:
    # read a line from filename, put it in username_set
  ...

 # loop through possible usernames_ files, and spawn a thread for each:
 # for.....
 f = Filereader('usernames_aa', usernames_one)
 filereaders.append(f)
 f.start()
 # do the same loop for usernames_two

 # at the end, wait for all threads to complete
 for f in filereaders:
     f.join()

 # then do simple set intersection:
 common_usernames = usernames_one ^ usernames_two

 # then write common set to a file:
 common_file = open("common_usernames.txt",'w')
 common_file.write('\n'.join(common_usernames))

You'll have to check if set addition is a thread-safe procedure. If not, you can of course create a list of sets (one for each file processed by the thread), and at the end union them all before intersecting.

This is actually a one-time operation, but , if it still remains painfully slow, i will give threads a try for sure. — crazyaboutliv, Jan 19 '11 at 10:16

Comparing 40 million lines in a file with 6 million list items in Python

3 Answers3