Most efficient way to convert items of a list to int and sum them up

Question

I'm doing something like this to sum up a number of elements of a line:

for line in open(filename, 'r'):
   big_list = line.strip().split(delim)
   a = sum(int(float(item)) for item in big_list[start:end] if item)  
   # do some other stuff

this is done line by line with a big file, where some items may be missing, i.e., equal to ''. If I use the statement above to compute a, the script becomes much slower than without it. Is there a way to speed it up?

what kind of data are you summing, are there actual floats like 3.4? — Padraic Cunningham, Aug 19 '14 at 16:54
Some quick `timeit` testing suggests that I may have given you bad advice on skipping the `[]`s - it's slightly quicker to pass a list to `sum`. Gains there will probably be marginal, though. — jonrsharpe, Aug 19 '14 at 17:05
@Padraic I'm truncating for other reasons as well, but even if I only use float, it still takes a long time to complete. I think the main problem is summation of the elements of the list — Bob, Aug 19 '14 at 17:30
It is about 40 percent faster just using float on my timings — Padraic Cunningham, Aug 19 '14 at 17:31
try `sum(map(float,filter(None,big_list)))`, where are your empty strings coming from anyway? — Padraic Cunningham, Aug 19 '14 at 17:38
also are you actually calculating the sum of each line or all lines? — Padraic Cunningham, Aug 19 '14 at 17:48
i'm computing the sum everytime I read a new line. Because of the way the input is, some tokens in the line may be empty string — Bob, Aug 19 '14 at 17:54
no, I'm not. using map improved a little bit, not much but didn't hurt — Bob, Aug 19 '14 at 18:14
what are you doing with a each time? Can you add a sample of your input, there may be a faster way using numpy — Padraic Cunningham, Aug 19 '14 at 18:15
there are several problem. I'm not sure if numpy or pandas handle very large files that won't fit in memory. Assuming they do, I have several and quite complex conditions upon which I decide to process a line or not. So I don't think I can use them — Bob, Aug 19 '14 at 18:20
can you add the content of the file ? I think we are re-inventing the wheel here — fabrizioM, Aug 19 '14 at 18:44

score 0 · Answer 1 · edited May 23 '17 at 12:22

As Padraic commented, use filter to trim out empty strings, then drop "if item":

>>> import timeit
>>> timeit.timeit("sum(int(float(item)) for item in ['','3.4','','','1.0'] if item)",number=10000)
0.04612559381553183
>>> timeit.timeit("sum(int(float(item)) for item in filter(None, ['','3.4','','','1.0']))",number=10000)
0.04827789913997549
>>> sum(int(float(item)) for item in filter(None, ['','3.4','','','1.0']))
4
>>>

Counterproductive in this example, but might reduce in your context. Measure to see.

see also this answer

score 0 · Answer 2 · answered Aug 19 '14 at 18:40

This isn't tested, but intuitively I would expect skipping the intermediary float conversion would be helpful. You want to grab the integer to the left of the decimal, so I would try doing that directly via regular expression:

import re

pattern = re.compile("\d+")

Then replace the float parsing with the regex match:

sum(int(pattern.search(item).group(0)) for item in big_list[start:end] if item)

If you don't need to keep the old decimal strings, you could also get these on the fly as you build big_list. For example, say we have the line "6.0,,1.2,3.0,". We could get matches like this:

delim = ","
pattern = re.compile("(\d+)\.\d+|" + re.escape(delim) + re.escape(delim) + "|$")

The results of this pattern on the line would be: ['6', '', '1', '3', ''], which could then be sliced and filtered as usual without the need of float parsing:

for line in open(filename, 'r'):
    big_list = pattern.findall(line)
    a = sum(int(item) for item in big_list[start:end] if item)

actually, I removed casting to int. I'm just checking that there is at least one non-zero element. Otherwise, I drop the line. — Bob, Aug 19 '14 at 20:11

Most efficient way to convert items of a list to int and sum them up

2 Answers2