efficient way of reading integers from file

Question

I'd like to read all integers from a file into the one list. All numbers are separated by space (one or more) or end line character (one or more). What is the most efficient and/or elegant way of doing this? I have two solutions, but I don't know if they are good or not.

Checking for digits:

for line in open("foo.txt", "r"):
    for i in line.strip().split(' '):
        if i.isdigit():
            my_list.append(int(i))

Dealing with exceptions:

for line in open("foo.txt", "r"):
    for i in line:
        try:
            my_list.append(int(i))
        except ValueError:
            pass

Sample data:

1   2     3
 4 56
    789         
9          91 56   

 10 
11

This is how I will probably do it `with open('foo.txt') as f: my_list = [int(i) for i in f if i.isdigit()]` — styvane, Jul 31 '15 at 09:13
@user3100115, It will not work because of trailing newlines. — falsetru, Jul 31 '15 at 09:15
I would prefer #2 over #1—`int()` validates the string you give it anyway, so validating yourself before calling `int()` just wastes time. — Blacklight Shining, Aug 01 '15 at 17:02
…except that [#2 doesn't actually do the same thing as #1](https://stackoverflow.com/questions/31742326/efficient-way-of-reading-integers-from-file#comment51460227_31742986). Take another look at it—it iterates over every character in each line and tries to add it to the list. — Blacklight Shining, Aug 01 '15 at 17:22

Anand S Kumar · Accepted Answer · 2015-08-01T17:11:50.400

An efficient way of doing it would be your first method with a small change of using with statement for opening the file , Example -

with open("foo.txt", "r") as f:
    for line in f:
        for i in line.split():
            if i.isdigit():
                my_list.append(int(i))

Timing tests done with comparisons to other methods -

The functions -

def func1():
    my_list = []
    for line in open("foo.txt", "r"):
        for i in line.strip().split(' '):
            if i.isdigit():
                my_list.append(int(i))
    return my_list

def func1_1():
    return [int(i) for line in open("foo.txt", "r") for i in line.strip().split(' ') if i.isdigit()]

def func1_3():
    my_list = []
    with open("foo.txt", "r") as f:
        for line in f:
            for i in line.split():
                if i.isdigit():
                    my_list.append(int(i))
    return my_list

def func2():            
    my_list = []            
    for line in open("foo.txt", "r"):
        for i in line.split():
            try:
                my_list.append(int(i))
            except ValueError:
                pass
    return my_list

def func3():
    my_list = []
    with open("foo.txt","r") as f:
        cf = csv.reader(f, delimiter=' ')
        for row in cf:
            my_list.extend([int(i) for i in row if i.isdigit()])
    return my_list

Results of timing tests -

In [25]: timeit func1()
The slowest run took 4.70 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 204 µs per loop

In [26]: timeit func1_1()
The slowest run took 4.39 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 207 µs per loop

In [27]: timeit func1_3()
The slowest run took 5.46 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 191 µs per loop

In [28]: timeit func2()
The slowest run took 4.09 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 212 µs per loop

In [34]: timeit func3()
The slowest run took 4.38 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 202 µs per loop

Given the methods that store the data into a list, I believe func1_3() above is fastest (As shown by the timeit).

But given that , if you are really handling very large files , then you maybe better off using a generator rather than storing the complete list in memory.

UPDATE : As it was being said in the comments that func2() is faster than func1_3() (Though on my system it was never faster than func1_3() even for only integers) , updated the foo.txt to contain things other than numbers and taking timing tests -

foo.txt

1 2 10 11
asd dd
 dds asda
22 44 32 11   23
dd dsa dds
21 12
12
33
45
dds
asdas
dasdasd dasd das d asda sda

Test -

In [13]: %timeit func1_3()
The slowest run took 6.17 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 210 µs per loop

In [14]: %timeit func2()
1000 loops, best of 3: 279 µs per loop

In [15]: %timeit func1_3()
1000 loops, best of 3: 213 µs per loop

In [16]: %timeit func2()
1000 loops, best of 3: 273 µs per loop

Your `func2()` iterates over individual digits, rather than tokens as `func1()` does. Profiling on my system finds that a correct `func2()` (which I'll call `func2_1()`) is actually _faster_ than `func1()` and `func1_3()`. — Blacklight Shining, Aug 01 '15 at 17:00
What is a correct `func2()` , its not my `func2()` , please check OP's question , its his code — Anand S Kumar, Aug 01 '15 at 17:01
Anyway updated the post with the correct `func2()` ,it is still slower on my system. — Anand S Kumar, Aug 01 '15 at 17:05
And this is given the fact that the data actually did not have anything other than numbers. Assuming if there were things other than numbers, the performance of `func2()` would very much slower than what it is right now, since raising and catching exceptions are costly. — Anand S Kumar, Aug 01 '15 at 17:07
@BlacklightShining Updated the post with the timing results when `foo.txt` has alphabets and words. — Anand S Kumar, Aug 01 '15 at 17:12
Also, while some implementations (CPython?) will delete your file object immediately as `func*()` returns, other implementations may keep it around for longer. If the file was written to, this means that it might not get flushed, which is a Bad Thing™. [You should really get into the habit of using `with`](http://blog.lerner.co.il/dont-use-python-close-files-answer-depends/), rather than relying on the interpreter to destroy out-of-scope file objects. — Blacklight Shining, Aug 01 '15 at 17:18

SuperBiasedMan · Answer 2 · 2015-07-31T13:20:06.620

It's pretty easy if you can read the whole file as a string. (ie. it's not too large to do that)

fileStr = open('foo.txt').read().split() 
integers = [int(x) for x in fileStr if x.isdigit()]

read() turns it into a long string, and split splits apart into a list of strings based on whitespace (ie. Spaces and newlines). So you can combine that with a list comprehension that converts them to integers if they're digits.

As Bakuriu noted, if the file is guaranteed to only have whitespace and numbers, then you don't have to check for isdigit(). Using list(map(int, open('foo.txt').read().split())) would be enough in that case. That method will raise errors if anything is an invalid integer whereas the other will skip anything that isn't a recognised digit.

score 4 · Answer 3 · answered Jul 31 '15 at 10:29

4

Thank you all. I've mixed some solutions you posted. This seems very good to me:

with open("foo.txt","r") as f:
    my_list = [int(i)  for line in f for i in line.split() if i.isdigit()]

answered Jul 31 '15 at 10:29

Marcel

269
2
15

It's more concise than using try-except (those aren't supported in comprehensions…_yet_), but less _efficient_ than it could be because you're duplicating the work `int()` does by calling `str.isdigit()`. – Blacklight Shining Aug 01 '15 at 17:04

The6thSense · Answer 4 · 2015-07-31T10:09:33.993

You could do it like this using list comprehension

my_list = [int(i)  for j in open("1.txt","r") for i in j.strip().split(" ") if i.isdigit()]

Or with open() method:

with open("1.txt","r") as f:
    my_list = [int(i)  for j in f for i in j.strip().split(" ") if i.isdigit()]

process:

1.First you will be iterating over the line

2.Then you will be iterating over the words and see it they are digit if so we add them to list

edit:

You need to addstrip()to line because every end of line (except last line) will have new line space ("\n") in them and is you try is.digit("number\n") you will get false

i.e)

>>> "1\n".isdigit()
False

edit2:

Input:

1
qw 2
23 we 32

File data when read:

a=open("1.txt","r")

repr(a.read())
"'1\\nqw 2\\n23 we 32'"

You can see the "\n" new line right it will affect the process

When I run the function with out strip() it will not take 1 and 2 as a digit because it consists of new line characters

my_list = [int(i)  for j in open("1.txt","r") for i in j.split(" ") if i.isdigit()]
my_list
[23, 32]

From the output it is clear 1 and 2 are missing .This can be avoided if we used strip()

I've changed it a little and it seems good to me: ` with open("foo.txt","r") as f: my_list = [int(i) for line in f for i in line.split() if i.isdigit()] ` — Marcel, Jul 31 '15 at 09:57
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/84782/discussion-between-vignesh-kalai-and-marcel). — The6thSense, Jul 31 '15 at 10:17

score 3 · Answer 5 · answered Jul 31 '15 at 09:19

why not use yield keyword ? the code will be as...

def readInt():
    for line in open("foo.txt", "r"):
        for i in line.strip().split(' '):
            if i.isdigit():
                yield int(i)

then you can read

    for num in readInt():
        list.append(num)

score 3 · Answer 6 · answered Jul 31 '15 at 09:19

3

my_list = []
with open('foo.txt') as f:
    for line in f:
        for s in line.split():
            try:
                my_list.append(int(s))
            except ValueError:
                pass

answered Jul 31 '15 at 09:19

simleo

2,775
22
23

Totem · Answer 7 · 2015-07-31T10:31:33.427

Try this:

with open('file.txt') as f:
    nums = []
    for l in f:
        l = l.strip()
        nums.extend([int(i) for i in l.split() if i.isdigit() and l])

l.strip() is required above if newlines('\n') are present, as i.isdigit('6\n') won't work.

list.extend comes in handy here

The and l at the end makes sure to discard any empty list result

str.split splits on whitespace by default. And the with block will automatically close the file after the code within is executed. I've also made use of list comprehensions

Dunes · Answer 8 · 2015-08-01T17:40:42.313

0

This was the fastest way I found:

import re
regex = re.compile(r"\D+")

with open("foo.txt", "r") as f:
    my_list = list(map(int, regex.split(f.read())))

Though the results could depend on the size of the file.

edited Aug 01 '15 at 17:40

answered Aug 01 '15 at 17:34

Dunes

37,291
7
81
97

efficient way of reading integers from file

8 Answers8

Related