4

I'm on Windows platform and using Python 3. Since the default behavior of file readers is to consume file line by line, I have difficulty dealing with my 100GB text file which has only one line.

I'm aware of solutions such as this for introducing a custom record separator for replacing a frequent character with \n; but I wonder is there anyway that I could consume and process my file only via Python?

I have only 8GB of ram. My file is the records of sales (including item, price, buyer, ...). My processing of the file is mostly editing price numbers. Records are separated from each other using | character.

user4157124
  • 2,809
  • 13
  • 27
  • 42
wiki
  • 1,877
  • 2
  • 31
  • 47

2 Answers2

5
# !/usr/bin/python3
import os, sys

# Open a file
fd = os.open("foo.txt",os.O_RDWR)

# Reading text
ret = os.read(fd,12)
print (ret.decode())

# Close opened file
os.close(fd)
print ("Closed the file successfully!!")

or

with open(filename, 'rb') as f:
    while True:
        buf = f.read(max_size)
        if not buf:
            break
        process(buf)

or

from functools import partial

with open('somefile', 'rb') as openfileobject:
    for chunk in iter(partial(openfileobject.read, 1024), b''):
        do_something()
kgr
  • 779
  • 6
  • 8
2

if you're running a 64bit OS, you could mmap the whole file in and let your OS actually do the reading in the background for you. mmaped files mostly present the same interface as a bytearray, so you could do things like:

import mmap

with open('largefile.txt', 'rb') as fd:
    buf = mmap.mmap(fd.fileno(), 0, access=mmap.ACCESS_READ)

you'd then be able to use buf as a normal bytearray, with operations like this to iterate over your seperator:

def split_sep(buf, sep=b'|'):
  pos = 0
  while True:
    end = buf.find(sep, pos)
    if end == -1:
       break
    yield buf[pos:end]
    pos = end + 1
  yield buf[pos:]

but this is just a demo. you'd probably want to do something more complicated, maybe decoding from bytes before yielding etc.

Sam Mason
  • 15,216
  • 1
  • 41
  • 60