An efficient way of doing it would be your first method with a small change of using with
statement for opening the file , Example -
with open("foo.txt", "r") as f:
for line in f:
for i in line.split():
if i.isdigit():
my_list.append(int(i))
Timing tests done with comparisons to other methods -
The functions -
def func1():
my_list = []
for line in open("foo.txt", "r"):
for i in line.strip().split(' '):
if i.isdigit():
my_list.append(int(i))
return my_list
def func1_1():
return [int(i) for line in open("foo.txt", "r") for i in line.strip().split(' ') if i.isdigit()]
def func1_3():
my_list = []
with open("foo.txt", "r") as f:
for line in f:
for i in line.split():
if i.isdigit():
my_list.append(int(i))
return my_list
def func2():
my_list = []
for line in open("foo.txt", "r"):
for i in line.split():
try:
my_list.append(int(i))
except ValueError:
pass
return my_list
def func3():
my_list = []
with open("foo.txt","r") as f:
cf = csv.reader(f, delimiter=' ')
for row in cf:
my_list.extend([int(i) for i in row if i.isdigit()])
return my_list
Results of timing tests -
In [25]: timeit func1()
The slowest run took 4.70 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 204 µs per loop
In [26]: timeit func1_1()
The slowest run took 4.39 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 207 µs per loop
In [27]: timeit func1_3()
The slowest run took 5.46 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 191 µs per loop
In [28]: timeit func2()
The slowest run took 4.09 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 212 µs per loop
In [34]: timeit func3()
The slowest run took 4.38 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 202 µs per loop
Given the methods that store the data into a list, I believe func1_3()
above is fastest (As shown by the timeit).
But given that , if you are really handling very large files , then you maybe better off using a generator rather than storing the complete list in memory.
UPDATE : As it was being said in the comments that func2()
is faster than func1_3()
(Though on my system it was never faster than func1_3()
even for only integers) , updated the foo.txt
to contain things other than numbers and taking timing tests -
foo.txt
1 2 10 11
asd dd
dds asda
22 44 32 11 23
dd dsa dds
21 12
12
33
45
dds
asdas
dasdasd dasd das d asda sda
Test -
In [13]: %timeit func1_3()
The slowest run took 6.17 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 210 µs per loop
In [14]: %timeit func2()
1000 loops, best of 3: 279 µs per loop
In [15]: %timeit func1_3()
1000 loops, best of 3: 213 µs per loop
In [16]: %timeit func2()
1000 loops, best of 3: 273 µs per loop