0

I have a text file with numbers. How do I read the numbers from the text file, one at a time? As in:

def readNumber( file ):
    ....
    return mynumber

The text file may contain arbitrary white space and empty lines, and the line length may be arbitrarily long. It might even be several GB without a line break.

shuhalo
  • 5,732
  • 12
  • 43
  • 60

2 Answers2

2

You can read the file one byte at a time, building ints as you go. Since it sounds like the file can be pretty big, I've added a generator function to read the file in chunks:

# Buffered file reader - reads file in "bytes_per_read" chunks
# but returns 1 byte at a time
# I'm not familiar enough with Python I/O to know if this
# is necessary or not
def file_buf_gen(f, bytes_per_read=1024):
    while True:
        buffer = f.read(bytes_per_read)
        if not buffer:
            break
        yield from buffer

# Python 3.8+ version
#def file_buf_gen(f, bytes_per_read=1024):
#    while buffer := f.read(bytes_per_read):
#        yield from buffer


# Yields all numbers in a file. Ignores anything that is not '0'-'9'
def read_numbers(file):
    num = None
    for b in file_buf_gen(file):
        if b.isnumeric():
           num = num * 10 + int(b) if num != None else int(b)
        elif num != None:
            yield num
            num = None
    if num != None:
            yield num

with open(path_to_file, "r") as f:
    for n in read_numbers(f):
        print(n)
001
  • 13,291
  • 5
  • 35
  • 66
  • `isnumeric` is true for some characters that make `int` crash, for example `'²'`. But [`isdecimal` works](https://tio.run/##VU5LCsMgEN17itlFIYSEbkqhJwlZyKjNFKOiFtLTWxOS0r7VzJv3mfDOs3eXa4ilmOgXQG@txkzeJaAl@JhBaSNfNivCzJjSSIu0Ce6/PLeUsmDGRyAgB1G6h@b9Ogx9hbgxqMDqwTlyEvtKBrCjdATyQ7Th7BjJZY5i6mQI2qk6sr1i3SpSfU0rfmoPe4ibZ23Bave9jeskWmia7unpnxWlfAA). – Kelly Bundy Nov 29 '21 at 20:50
0

An iterator producing the integers in the file:

from itertools import chain, groupby
from functools import partial

def numbers(file):
    chunks = iter(partial(file.read, 1024), '')
    chars = chain.from_iterable(chunks)
    for isspace, group in groupby(chars, str.isspace):
        if not isspace:
            yield int(''.join(group))

Splits by whitespace, so also does negative numbers.

Benchmark with shuffled range(10 ** 6):

2.03 seconds  numbers_Johnny_Mopp
0.99 seconds  numbers_Kelly_Bundy

2.06 seconds  numbers_Johnny_Mopp
0.92 seconds  numbers_Kelly_Bundy

2.04 seconds  numbers_Johnny_Mopp
0.95 seconds  numbers_Kelly_Bundy

Full code with sample data creator, correctness check, and benchmark:

from timeit import timeit
from random import randint, choices, shuffle
from itertools import chain, groupby
from functools import partial
from collections import deque

# Create test data
numbers = list(range(10 ** 6))
shuffle(numbers)
with open('test.txt', 'w') as f:
    for number in numbers:
        print(number, end=''.join(choices(' \t\n', k=randint(1, 2))), file=f)

def numbers_Kelly_Bundy(file):
    chunks = iter(partial(file.read, 1024), '')
    chars = chain.from_iterable(chunks)
    for isspace, group in groupby(chars, str.isspace):
        if not isspace:
            yield int(''.join(group))

def numbers_Johnny_Mopp(file):

    def file_buf_gen(f, bytes_per_read=1024):
        while buffer := f.read(bytes_per_read):
            yield from buffer
        
    # Yields all numbers in a file. Ignores anything that is not '0'-'9'
    def read_numbers(file):
        num = None
        for b in file_buf_gen(file):
            if b.isnumeric():
               num = num * 10 + int(b) if num != None else int(b)
            elif num != None:
                yield num
                num = None
        if num != None:
                yield num
    
    return read_numbers(file)

funcs = numbers_Johnny_Mopp, numbers_Kelly_Bundy

# Correctness check
for func in funcs:
    with open('test.txt') as file:
        print(list(func(file)) == numbers)

# Speed tests
for _ in range(3):
    for func in funcs:
        with open('test.txt') as file:
            t = timeit(lambda: deque(func(file), 0), number=1)
            print('%4.2f seconds ' % t, func.__name__)
    print()
Kelly Bundy
  • 23,480
  • 7
  • 29
  • 65
  • 2
    FWIW, I didn't downvote. As a primarily C++ and C# programmer, I find Python pretty amazing and am always happy to learn. Where in C++ you do most things "the hard way", it seems with Python that there's already an existing library for whatever you need. Maybe my code is preferred for its obviousness/readability where as yours may be considered "obscure" as most of the work is done by itertools and functools?? Or, maybe your answer needs a more in-depth explanation - most of your answer is about comparing the two codes. – 001 Nov 30 '21 at 15:27
  • +1. I don't use itertools or functools too much so I had to spend some time researching what those functions do. I have learned quite a bit and will be utilizing those libs more. Thanks. – 001 Dec 01 '21 at 13:58