3

I have about 50 million lists of strings in Python like this one:

["1", "1.0", "", "foobar", "3.0", ...]

And I need to turn these into a list of floats and Nones like this one:

[1.0, 1.0, None, None, 3.0, ...]

Currently I use some code like:

def to_float_or_None(x):
    try:
        return float(x)
    except ValueError:
        return None

result = []
for record in database:
    result.append(map(to_float_or_None, record))

The to_float_or_None function is taking in total about 750 seconds (according to cProfile)... Is there a faster way to perform this conversion from a list of strings to a list of floats/Nones?

Update
I had identified the to_float_or_None function as the main bottleneck. I can not find a significant difference in speed between using map and using list comprehensions. I applied Paulo Scardine's tip to check the input, and it already saves 1/4 of the time.

def to_float_or_None(x):
    if not(x and x[0] in "0123456789."):
        return None
    try:
        return float(x)
    except:
        return None

The use of generators was new to me, so thank you for the tip Cpfohl and Lattyware! This indeed speeds up the reading of the file even more, but I was hoping to save some memory by converting the strings to floats/Nones.

Tader
  • 25,802
  • 5
  • 26
  • 27
  • 2
    Maybe you can run it in parallel over several lists (if you have >1 cores). – Lev Levitsky Mar 29 '12 at 12:55
  • 3
    Your last 3 lines can be written as `result = map(to_float_or_None, database)` – Paolo Moretti Mar 29 '12 at 13:02
  • 2
    `[ float(x) if x and x[0] in "0123456789" else None for x in yourlist ]` – Paulo Scardine Mar 29 '12 at 13:05
  • @PaoloMoretti Why use map when a list comprehension is [faster](http://wiki.python.org/moin/PythonSpeed/PerformanceTips#Loops) and nicer to read? – Gareth Latty Mar 29 '12 at 13:08
  • @PauloScardine what about `"123helloEXception"`? – Lev Levitsky Mar 29 '12 at 13:08
  • 1
    @PauloScardine That's a hackish solution. What about this: ``["1", "1.0", "", "5 foobar", "3.0", ...]``? It'll throw an exception with your solution. – Gareth Latty Mar 29 '12 at 13:09
  • 2
    @Lattyware Faster? I don't think so. map may be faster in some cases, list comprehensions in other cases. See [Python List Comprehension Vs. Map](http://stackoverflow.com/questions/1247486/python-list-comprehension-vs-map) – Paolo Moretti Mar 29 '12 at 13:15
  • @PaoloMoretti Hrm, that contradicts what is said in the link I posted, but as your link provides concrete tests, I'd take that as a better source. – Gareth Latty Mar 29 '12 at 13:21
  • @LevLevitsky and others commenting about a hackish solution: that is why its a comment instead of an answer ;-) - may be the OP can elaborate from there. – Paulo Scardine Mar 29 '12 at 13:38
  • @PauloScardine The test `x and x[0] in "0123456789."` already made the to_float_or_None 1/4 faster :) (I still kept the try/except for the reasons mentioned by @Lattyware) – Tader Mar 29 '12 at 15:20

4 Answers4

2

Edit: I have just realised I misread the question, and we are talking about a list of lists, not just a list. Updated to suit that.

You can use a list comprehension here to produce something a bit faster and nicer to read:

def to_float_or_None(x):
    try:
        return float(x)
    except ValueError:
        return None

database = [["1", "1.0", "", "foobar", "3.0"], ["1", "1.0", "", "foobar", "3.0"]]

result = [[to_float_or_None(item) for item in record] for record in database]

Giving us:

[[1.0, 1.0, None, None, 3.0], [1.0, 1.0, None, None, 3.0]]

Edit: As noted by Paolo Moretti in the comments, if you want the absolute fastest result, then using map may be faster as we are not using a lambda function:

def to_float_or_None(x):
    try:
        return float(x)
    except ValueError:
        return None

database = [["1", "1.0", "", "foobar", "3.0"], ["1", "1.0", "", "foobar", "3.0"]]

result = [list(map(to_float_or_None, record)) for record in database]

Giving us the same result. I would note, however, that premature optimization is a bad thing. If you have identified this as a bottleneck in your application, then fair enough, but otherwise stick with the more readable over the fast.

We still use a list comprehension for the outer loop as we would need a lambda function to use map again given it's dependence on record:

result = map(lambda record: map(to_float_or_None, record), database)

Naturally, if you want to evaluate these lazily, you can use generator expressions:

((to_float_or_None(item) for item in record) for record in database)

Or:

(map(to_float_or_None, record) for record in database)

This would be the preferred method unless you need the entire list at once.

Community
  • 1
  • 1
Gareth Latty
  • 86,389
  • 17
  • 178
  • 183
  • It would be nice to see some actual benchmarks. On my simple test case this is slightly slower than the OP's version. – NPE Mar 29 '12 at 13:10
  • For one, readability should trump speed unless you can prove that it's a bottleneck. Maybe in a few cases you have found a list comp will be slower than a loop, but in general the list comp will be faster - see my link in my answer for why that is. – Gareth Latty Mar 29 '12 at 13:15
  • 2
    The question is specifically about making things faster. I quote: "Is there a **faster** way to perform this conversion...?" (emphasis mine). – NPE Mar 29 '12 at 13:17
  • This is true, but my point still stands - list comps should be faster as the loop is implemented at a lower level, reducing the overhead. This may not be true in a few cases, but in general, it should be the fastest method. – Gareth Latty Mar 29 '12 at 13:19
  • 1
    You seem to assert that yours is the fastest method. Please show some actual benchmarks to back this up with regards to this specific question. – NPE Mar 29 '12 at 13:22
  • I have not seen any real difference between the `map` solution and the list comprehension. Thank you for introducing me to generator expressions! – Tader Mar 29 '12 at 15:56
2

I don't know about the performance aspect, but this should work for your case.

list_floats = [to_float_or_None(item) for item in original_list]
Nostradamnit
  • 862
  • 1
  • 10
  • 20
  • 1
    On my simple test case this is slightly slower than the OP's version. – NPE Mar 29 '12 at 13:10
  • I don't see any difference in speed between using a map or using list comprehension, most of the time is spent in the to_float_or_None function. – Tader Mar 29 '12 at 15:09
2

Or, if you really have that much data in lists, maybe use something like pandas Series and apply() a lambda function to convert:

import pandas,re

inlist = ["1", "1.0", "", "foobar", "3.0"] # or however long...
series = pandas.Series(inlist)
series.apply(lambda x: float(x) if re.match("^\d+?(\.\d+?)*$",x) else None)

Out[41]: 
0     1
1     1
2   NaN
3   NaN
4     3

Lots of other advantages - not least for specifying afterwards how you want to handle those missing values...

timlukins
  • 2,694
  • 21
  • 35
  • Thank you for pointing me to pandas. For my current issue, I do not think (have not checked, sorry) that it will be faster than a map over the list. I will consider it by the next implementation :) – Tader Mar 29 '12 at 15:53
2

The answers given thus far don't really fully answer the question. try...catch vs a validating if then can result in different performance (see: https://stackoverflow.com/a/5591737/456188). To summarize that answer: depends on the ratio of failures to successes and the MEASURED time of a failure and success in both cases. Basically we can't answer this, but we can tell you how to:

  1. Look at a few representative cases to get a ratio.
  2. Write an if/then that tests the same as the try/catch optimize it and then measure how long it takes both version of the to_float_or_None to fail 100 times and measure how long it takes both versions of the to_float_or_None to succeed 100 times.
  3. Do a little math to figure out which will be faster.

Side note about the list comprehension issue:

Depending on whether the you want to be able to index the results of this, or whether you just want to iterate over it a generator expression would actually be even better than a list comprehension (just replace the [ ] characters with ( ) characters).

It takes essentialy no time to create, and the actual execution of to_float_or_None (which is the expensive part) can be delayed until the result it needed.

This is useful for many reasons, but won't work if you're going to need to index it. It will however, allow you to zip the original collection with the generator so you can still have access to the original string along with its float_or_none result.

Community
  • 1
  • 1
Chris Pfohl
  • 18,220
  • 9
  • 68
  • 111
  • This is true - the _ask for forgiveness, not permission_ mantra only works if the exception is the exception to the rule. In the majority of cases this will be true - and I think, given the examples, we can presume so for the case as well, but if that isn't so, then it may be worth validating first, rather than catching the exception upon failure. – Gareth Latty Mar 29 '12 at 13:54
  • This is the only answer that does not only invite me to rewrite the "loop", but addresses the actual function. Indeed checking saved me time! – Tader Mar 29 '12 at 15:50