Combine gigabytes worth of text into one file, sorted by number of occurrences

Question

My goal for this script is to take a folder full of text files, capture each line in all files, and then output one file containing every unique line in descending order of frequency.

It doesn't just find the unique lines, it finds how frequently each unique line appears in all the files.

It needs to handle a LOT of text with this script - around 2GB at least, so I need it done efficiently. So far, I have not achieved this goal.

import os, sys #needed for looking into a directory
from sys import argv #allows passing of arguments from command line, where I call the script
from collections import Counter #allows the lists to be sorted by number of occurrences

#Pass argument containing Directory of files to be combined
dir_string = str((argv[1]))

filenames=[]  

#Get name of files in directory, add them to a list
for file in os.listdir(dir_string):
    if file.endswith(".txt"):
        filenames.append(os.path.join(dir_string, file)) #add names of files to a list

#Declare name of file to be written
out_file_name = dir_string+".txt"

#Create output file
outfile = open(out_file_name, "w")

#Declare list to be filled with lines seen
lines_seen = []

#Parse All Lines in all files
for fname in filenames: #for all files in list
    with open(fname) as infile: #open a given file
        for line in infile: #for all lines in current file, read one by one
                #Here's the problem.
                lines_seen.append(str(line).strip('\n')) #add line to list of lines seen,
                                                         #removing the endline

    #Organizes the list by number of occurences, but produced a list that contains
    # [(item a, # of a occurrences ), (item b, # of b occurrences)...]
    lines_seen = Counter(lines_seen).most_common()

    #Write file line by line to the output file
    for item in lines_seen: outfile.write(str(item[0])+"\n")

outfile.close()

When I get an error message, it is about the line lines_seen.append(str(line).strip('\n')).

I first tried to add the lines without converting to string and stripping, but it would include a visible '\n' in the string which was not acceptable to me. For smaller lists, converting to string and stripping wasn't too memory taxing. I couldn't figure out a more efficient way of getting rid of endline character

On my PC, this causes MemoryError, on my Mac this gives me Killed: 9 - haven't tried it on Linux yet.

Do I need to convert to binary, assemble my ordered list and then convert back? How else can this be done?

EDIT - It has become clear the best overall way for me to do this was with unix commands

cd DirectoryWithFiles
cat *.txt | sort | uniq -c | sort -n -r > wordlist_with_count.txt
cut  -c6- wordlist_with_count.txt > wordlist_sorted.txt

instead of keeping `List` in memory why not write your lines to a to a temp file? — Nishanth Matha, Mar 22 '17 at 01:04
as of writing this, I wasn't sure how to sort that file without putting it into a list or set which brings me back to the same problem — berzerk0, Mar 22 '17 at 01:10
according to this thread: http://stackoverflow.com/questions/41315394/file-size-limit-for-read you can read files upto 2gb — Nishanth Matha, Mar 22 '17 at 01:31
if it's more than 2gb as you mentioned in post...you better off diving it into chunks of file or even smaller chunks of lists ..and try sorting each chunk individually and write it to one main output file — Nishanth Matha, Mar 22 '17 at 01:34
I can, but that may defeat the purpose of getting total occurrences throughout the whole directory — berzerk0, Mar 22 '17 at 01:36
`beat the purpose` how I wonder you're still getting the total occurences throughout the directory...its only that you're creating intermidiate buffer to read and sort in chunks but your end output will still be what you desire — Nishanth Matha, Mar 22 '17 at 01:46
Wouldn't I then have to sort the big output file at the end? What is the most common in one chunk might not be the most common in another — berzerk0, Mar 22 '17 at 01:49
nope you wouldn't...it's more like a binary search!!! for instance if you have 3 chunks ...you sort `sort(1,2)` then `sort(2,3)` then `sort(1,2)` again that will give you `sort of (1,2,3)` you will be using a logic similar to this http://stackoverflow.com/questions/42893884/sorting-int-variables-using-a-function#42893938 — Nishanth Matha, Mar 22 '17 at 02:12
This sounds promising if the method below doesn't work out. And when I make a chunk, I sort it but do I delete the duplicates? If I understand you correctly, I don't, but in the end it is close enough. Is this correct? — berzerk0, Mar 22 '17 at 02:19

score 0 · Answer 1 · answered Mar 22 '17 at 01:51

I would have solved this problem like this

import os, sys #needed for looking into a directory
from sys import argv #allows passing of arguments from command line, where I call the script
from collections import Counter #allows the lists to be sorted by number of occurrences

#Pass argument containing Directory of files to be combined
dir_string = str((argv[1]))


#Get name of files in directory, add them to a list
filenames = []
for file in os.listdir(dir_string):
    if file.endswith(".txt"):
        filenames.append(os.path.join(dir_string, file)) #add names of files to a list


#Declare name of file to be written
out_file_name = os.path.join(dir_string, 'out.txt')


# write all the files to a single file instead of list
with open(out_file_name, "w") as outfile:
    for fname in filenames: #for all files in list
        with open(fname) as infile: #open a given file
              for line in infile: #for all lines in current file, read one by one
                   outfile.write(line)

# create a counter object from outfile
with open(out_file_name, "r") as outfile:
    c = Counter(outfile)



print "sorted by line alphabhitically"
from operator import itemgetter   
print sorted(c.items(),key=itemgetter(0))

print "sorted by count"
print sorted(c.items(), key=itemgetter(1))


def index_in_file(unique_line):
    with open(out_file_name, "r") as outfile:
        for num, line in enumerate(outfile, 1):
            if unique_line[0] in line:
                return num

print "sorted by apperance of line in the outfile"
s= sorted(c.items(),key=index_in_file)
print s

# Once you decide what kind of sort you want, write the sorted elements into a outfile.
with open(out_file_name, "w") as outfile:
    for ss in s:
        outfile.write(ss[0].rstrip()+':'+str(ss[1])+'\n')

score 0 · Answer 2 · answered Mar 22 '17 at 06:47

This is the approach to reduce memory consumption I was suggesting in the comments under one of the other answers:

lines_seen = collections.Counter()

for filename in filenames:
    with open(filename, 'r') as file:
        for line in file:
            line = line.strip('\n')
            if line:
                lines_seen.update([line])

with open(out_file_name, "w") as outfile:
    for line, count in lines_seen.most_common():
        outfile.write('{}, {}\n'.format(line, count))

Note that line.strip('\n') is only removing the newline at the end of each line read, so line.rstrip('\n') would be more efficient. You might also want to remove leading and trailing whitespace by using line.strip(). Getting rid of the, possibly considerable, whitespace being stored would further reduce memory usage.

User9123 · Answer 3 · 2017-03-22T02:11:50.130

Your problem is obviously a lack of memory.

You could eliminate the redundant lines in lines_seen during the process, it could help.

from collections import Counter
lines_seen = Counter()

# in the for loop :
lines_seen[ lines_seen.append(str(line).strip('\n')) ] += 1

# at the end:
for item in lines_seen.most_common():
    outfile.write(str(item[0])+"\n")

EDIT

An other solution would be, as mentioned in the comments :

from collections import Counter
lines_seen = Counter()

# get the files names

for fname in filenames: #for all files in list
    with open(fname) as infile: #open a given file
        lines_seen.update(infile.read().split('\n'))

for item in lines_seen.most_common():
    print( item[0], file=outfile )

Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackoverflow.com/rooms/138799/discussion-on-answer-by-user9123-combine-gigabytes-worth-of-text-into-one-file). — Bhargav Rao, Mar 23 '17 at 09:14

Combine gigabytes worth of text into one file, sorted by number of occurrences

3 Answers3