I need to use python to take N number of lines from large txt file. These files are basically tab delimited tables. My task has the following constraints:
- These files may contain headers (some have multi-line headers).
- Headers need to appear in the output in the same order.
- Each line can be taken only once.
- The largest file currently is about 150GB (about 60 000 000 lines).
- Lines are roughly the same length in a file, but may vary between different files.
- I will usually be taking 5000 random lines (I may need up to 1 000 000 lines)
Currently I have written the following code:
inputSize=os.path.getsize(options.input)
usedPositions=[] #Start positions of the lines already in output
with open(options.input) as input:
with open(options.output, 'w') as output:
#Handling of header lines
for i in range(int(options.header)):
output.write(input.readline())
usedPositions.append(input.tell())
# Find and write all random lines, except last
for j in range(int(args[0])):
input.seek(random.randrange(inputSize)) # Seek to random position in file (probably middle of line)
input.readline() # Read the line (probably incomplete). Next input.readline() results in a complete line.
while input.tell() in usedPositions: # Take a new line if current one is taken
input.seek(random.randrange(inputSize))
input.readline()
usedPositions.append(input.tell()) # Add line start position to usedPositions
randomLine=input.readline() # Complete line
if len(randomLine) == 0: # Take first line if end of the file is reached
input.seek(0)
for i in range(int(options.header)): # Exclude headers
input.readline()
randomLine=input.readline()
output.write(randomLine)
This code seems to be working correctly.
I am aware that this code prefers lines that follow the longest lines in input, because seek() is most likely to return a position on the longest line and the next line is written to output. This is irrelevant as lines in the input file are roughly the same length. Also I am aware that this code results in an infinite loop if N is larger than number of lines in input file. I will not implement a check for this, as getting the line count takes a lot of time.
RAM and HDD limitations are irrelevant. I am only concerned about the speed of the program. Is there a way to further optimize this code? Or perhaps there is a better approach?
EDIT: To clarify, the lines in one file have roughly the same length. However, i have multiple files that this script needs to run on and the average length of a line will be different for these files. For example file A may have ~100 characters per line and file B ~50000 characters per line. I do not know the average line length of any file beforehand.