My goal for this script is to take a folder full of text files, capture each line in all files, and then output one file containing every unique line in descending order of frequency.
It doesn't just find the unique lines, it finds how frequently each unique line appears in all the files.
It needs to handle a LOT of text with this script - around 2GB at least, so I need it done efficiently. So far, I have not achieved this goal.
import os, sys #needed for looking into a directory
from sys import argv #allows passing of arguments from command line, where I call the script
from collections import Counter #allows the lists to be sorted by number of occurrences
#Pass argument containing Directory of files to be combined
dir_string = str((argv[1]))
filenames=[]
#Get name of files in directory, add them to a list
for file in os.listdir(dir_string):
if file.endswith(".txt"):
filenames.append(os.path.join(dir_string, file)) #add names of files to a list
#Declare name of file to be written
out_file_name = dir_string+".txt"
#Create output file
outfile = open(out_file_name, "w")
#Declare list to be filled with lines seen
lines_seen = []
#Parse All Lines in all files
for fname in filenames: #for all files in list
with open(fname) as infile: #open a given file
for line in infile: #for all lines in current file, read one by one
#Here's the problem.
lines_seen.append(str(line).strip('\n')) #add line to list of lines seen,
#removing the endline
#Organizes the list by number of occurences, but produced a list that contains
# [(item a, # of a occurrences ), (item b, # of b occurrences)...]
lines_seen = Counter(lines_seen).most_common()
#Write file line by line to the output file
for item in lines_seen: outfile.write(str(item[0])+"\n")
outfile.close()
When I get an error message, it is about the line lines_seen.append(str(line).strip('\n'))
.
I first tried to add the lines without converting to string and stripping, but it would include a visible '\n' in the string which was not acceptable to me. For smaller lists, converting to string and stripping wasn't too memory taxing. I couldn't figure out a more efficient way of getting rid of endline character
On my PC, this causes MemoryError
, on my Mac this gives me Killed: 9
- haven't tried it on Linux yet.
Do I need to convert to binary, assemble my ordered list and then convert back? How else can this be done?
EDIT - It has become clear the best overall way for me to do this was with unix commands
cd DirectoryWithFiles
cat *.txt | sort | uniq -c | sort -n -r > wordlist_with_count.txt
cut -c6- wordlist_with_count.txt > wordlist_sorted.txt