41

i have a large text file (~7 GB). I am looking if exist the fastest way to read large text file. I have been reading about using several approach as read chunk-by-chunk in order to speed the process.

at example effbot suggest

# File: readline-example-3.py

file = open("sample.txt")

while 1:
    lines = file.readlines(100000)
    if not lines:
        break
    for line in lines:
        pass # do something**strong text**

in order to process 96,900 lines of text per second. Other authors suggest to use islice()

from itertools import islice

with open(...) as f:
    while True:
        next_n_lines = list(islice(f, n))
        if not next_n_lines:
            break
        # process next_n_lines

list(islice(f, n)) will return a list of the next n lines of the file f. Using this inside a loop will give you the file in chunks of n lines

Community
  • 1
  • 1
Gianni Spear
  • 7,033
  • 22
  • 82
  • 131
  • 2
    Why won't you check yourself what's fastest for you? – piokuc Feb 18 '13 at 19:54
  • Cehck out the suggestions here: http://stackoverflow.com/questions/14863224/efficient-reading-of-800-gb-xml-file-in-python-2-7 – BenDundee Feb 18 '13 at 19:56
  • @Nix i don't wish to read line by line, but chunk by chunk – Gianni Spear Feb 18 '13 at 20:07
  • 3
    If you look through the answers, someone shows how to do it in chunks. – Nix Feb 18 '13 at 20:15
  • dear @nix i read in http://effbot.org/zone/readline-performance.htm about "Speeding up line reading" the author suggests " if you’re processing really large files, it would be nice if you could limit the chunk size to something reasonable". The page is quite old "June 09, 2000" and i am looking if there is a more new (and fast) approach. – Gianni Spear Feb 18 '13 at 20:18

1 Answers1

15
with open(<FILE>) as FileObj:
    for lines in FileObj:
        print lines # or do some other thing with the line...

will read one line at the time to memory, and close the file when done...

Morten Larsen
  • 493
  • 1
  • 3
  • 11
  • 5
    Morten line-by-line became too slow. – Gianni Spear Feb 18 '13 at 20:05
  • 7
    aay, read too fast... – Morten Larsen Feb 18 '13 at 21:27
  • 1
    Looks like that the result of the loop of FileObj is a single character, not line. – Xb74Dkjb Jun 02 '17 at 01:13
  • The large 7GB file can contain only one line, and in this case, your solution will be as inefficient as just reading the whole file by `FileObj.read()`. It would be better to try several MB-chunks here (for example by 5 MB chunks), which can be accomplished by using `FileObj.read(5 * 1024 * 1024)` multiple times. – Demian Wolf Jun 21 '20 at 14:20
  • 1
    @DemianWolf Thanks for the comment, I have a question. What happens if the given input size truncates half of a word. For example, if the last word is Responsibility and you hit the chunk limit at Respon of the full word Responsibility, how would you handle it. Is there is way not to break the words or should we need to follow some other approach? Thanks! – Sunny Jul 17 '20 at 02:29
  • @Sunny, if the file is comparably small, you may just get all the words from the whole file content (`with open("my_file.txt") as fp: print(fp.read().split())`. Though in your case, it seems to me, you are trying to read a large file (otherwise why would you split it in chunks?). For this case, you can use the same chunking approach but with one difference. After you read a chunk, you should read next characters one-by-one until you get a space (or another similar character as \n, \r etc.), and then add the newly-read part of file to the last chunk. – Demian Wolf Jul 17 '20 at 14:47
  • 1
    @DemianWolf, I had a similar approach in mind but I was hoping maybe there will be a better way to handle it. Thanks anyway! – Sunny Jul 18 '20 at 00:14
  • I think this is the slowest method. It would be faster if it loads the data portion to the memory partially not the complete file content. – Guru Bhandari Mar 12 '22 at 21:38