First, the (IMO) simplest solution
If, as it seems, the lines are completely independent, just split your file in N chunks, pass the filename to open as a program argument and run multiple instances of your current script starting them manually on multiple command lines.
Pros:
- No need to delve with the multiprocessing, inter-process communication, etc
- Doesn't need to alter the code too much
Cons:
- You need to preprocess the big file splitting it into chunks (although this will be much faster than your current execution time, since you won't have an open-close-per-line scenario)
- You need to start the processes yourself, passing the appropriate filename for each of them
This would be implemented as:
Preprocessing:
APPROX_CHUNK_SIZE = 1e9 #1GB per file, adjust as needed
with open('big_file.txt') as fp:
chunk_id = 0
next_chunk = fp.readlines(APPROX_CHUNK_SIZE)
while next_chunk:
with open('big_file_{}.txt'.format(chunk_id), 'w') as ofp:
ofp.writelines(next_chunk)
chunk_id += 1
next_chunk = fp.readlines(APPROX_CHUNK_SIZE)
From the readlines
docs:
If the optional sizehint argument is present, instead of reading up to EOF, whole lines totalling approximately sizehint bytes (possibly after rounding up to an internal buffer size) are read.
Doing it this way won't ensure an even number of lines in all chunks, but will make preprocessing much faster,since you're reading in blocks and not line-by-line. Adapt the chunk size as needed.
Also, note that by using readlines
we can be sure we won't have lines broken between chunks, but since the function returns a list of lines, we use writelines
to write that in our output file (which is equivalent to loop over the list and ofp.write(line)
). For the sake of completeness, let me note that you could also concatenate all strings in-memory and call write
just once (i.e., do ofp.write(''.join(next_chunk))
), which might bring you some (minor) performance benefit, paid in (much) higher RAM usage.
Main script:
The only modifications you need are at the very top:
import sys
file=sys.argv[1]
... # rest of your script here
By using argv
you ca pass command-line arguments to your program (in this case, the file to open). Then, just run your script as:
python process_the_file.py big_file_0.txt
This will run one process. Open multiple terminals and run the same command with big_file_N.txt
for each and they'll be independent from each other.
Note: I use argv[1]
because for all programs the first value of argv
(i.e., argv[0]
) is always the program name.
Then, the multiprocessing
solution
Although effective, the first solution is not quite elegant, especially since you'll have 80 files if you start from a file 80GBs in size.
A cleaner solution is to make use of python's multiprocessing
module (important: NOT threading
! If you don't know the difference, look up "global interpreter lock" and why multithreading in python doesn't work the way you think it would).
The idea is to have one "producer" process that opens the big file and continuously puts lines from it in a queue. Then, a pool of "consumer" processes that extract from the queue the lines and do the processing.
Pros:
- One script does everything
- No need to open multiple terminals and do typing
Cons:
- Complexity
- uses inter-process communication, which has some overhead
This would be implemented as follows:
# Libraries
import os
import multiprocessing
outputdirectory="sorted"
depth=4 # This is the tree depth
# Process each line in the file
def pipeline(line):
# Strip symbols from line
line_stripped=''.join(e for e in line if e.isalnum())
# Reverse the line
line_stripped_reversed=line_stripped[::-1]
file=outputdirectory
# Create path location in folderbased tree
for i in range(min((depth),len(line_stripped))):
file=os.path.join(file,line_stripped_reversed[i])
# Create folders if they don't exist
os.makedirs(os.path.dirname(file), exist_ok=True)
# Name the file, with "-file"
file=file+"-file"
# This is the operation that slows everything down.
# It opens, writes and closes a lot of small files.
# I cannot keep them open because currently half a million possibilities (and thus files) are worst case open (n=26^4).
f = open(file, "a")
f.write(line)
f.close()
if __name__ == '__main__':
# Variables
file="80_gig_file.txt"
# Preperations
os.makedirs(outputdirectory)
pool = multiprocessing.Pool() # by default, 1 process per CPU
LINES_PER_PROCESS = 1000 # adapt as needed. Higher is better, but consumes more RAM
with open(file) as infile:
next(pool.imap(pipeline, infile, LINES_PER_PROCESS))
pool.close()
pool.join()
The if __name__ == '__main__'
line is a barrier to separate code that runs on every process from the one that runs only on the "father". Every process defines pipeline
, but only the father actually spawns a pool of workers and applies the function. You find more details about multiprocessing.map
here
Edit:
Added closing and joining o the pool to prevent the main process from exiting and killing the children in the process.