Most efficent way to read and process a large file

Question

I have a .txt file which has the following format:

1 a 0.01 0.03 0.01 ...
2 b 0.04 0.03 0.01 ...

which may contain any number of columns with additional numbers. I need to find the index of the maximum value in each row for a large (2.5 million) number of rows.

So far my approach was to preallocate an array and read the file line by line. I've been trying to avoid reading the whole file into memory due to its size:

import numpy as np

indices = []

with open(file.txt) as f:
    for line in f:
        numbers = [float(s) for s in line[2:]]
        indices.append(np.argmax(numbers))

This takes very long however and I am wondering if there is a more efficent method/package I could use?

How long a time is "very long"? Also, FYI that array is called a `list` in Python. — jarmod, Oct 27 '22 at 11:09
Welcome to stack overflow. Please review the posting guidelines. Seeking recommendations for books, tools, software libraries is likely to lead to opinion-based answers and is not the intended purpose of this site. — possum, Oct 27 '22 at 11:12
Tried taking a look at pandas? This should be faster and will give you a nice Dataframe format. — vegiv, Oct 27 '22 at 11:13
Does this answer your question? [How to read specific lines from a file (by line number)?](https://stackoverflow.com/questions/2081836/how-to-read-specific-lines-from-a-file-by-line-number) — roman_ka, Oct 27 '22 at 11:14
Does this answer your question? [Efficiently parsing a large text file in Python?](https://stackoverflow.com/questions/8131197/efficiently-parsing-a-large-text-file-in-python) — wovano, Oct 27 '22 at 11:18
Generator functions in python are a pretty neat way to cycle through a large amount of data without loading it all at once : [link](https://realpython.com/introduction-to-python-generators/), combine it with `pandas.read_csv(sep=' ')` — roman_ka, Oct 27 '22 at 11:19
If your file contains data as shown in your question, your code will fail due to ValueError. That's because, for the first line, you will try to convert 'a' to float - and that won't work — DarkKnight, Oct 27 '22 at 11:25

score 0 · Accepted Answer · answered Oct 27 '22 at 12:10

Reading a text file line by line is very efficient and obviates the need to have the entire file content in memory.

For the purpose of demonstration I have created a text file with 2,500,000 lines. Each line contains ten pseudo-random floating point numbers.

Processing goes like this:

from time import perf_counter

indexes = []
print('Processing...', end='', flush=True)
start = perf_counter()
with open('/Volumes/G-Drive/foo.txt') as data:
    for line in data:
        values = list(map(float, line.split()))
        idx = values.index(max(values))
        indexes.append(idx)
end = perf_counter()
print(f'Duration={end-start:.2f}')
print(len(indexes))

Output:

Processing...Duration=4.19
2500000

So, just over 4 seconds and confirmation that we've saved 2.5m indexes based on the position of the max value in any given line

score -1 · Answer 2 · answered Oct 27 '22 at 11:17

-1

Does this help? I took it from a similar question and adapted it to your case.

def recordsFromFile(inputFile):
    for line in inputFile:
        yield line

inputFile = open('test.txt')
for record in recordsFromFile(inputFile):
    # Do stuff

answered Oct 27 '22 at 11:17

Whoeza

77
7

1

This offers no advantage – DarkKnight Oct 27 '22 at 11:27

Most efficent way to read and process a large file

2 Answers2