-1

I need to convert all strings in a large array to int or float types, if they can be converted. Usually, people suggest try-except or regex approach (like in Checking if a string can be converted to float in Python), but it turns out to be very slow.

The question is: how to write that code the fastest way possible?

I found that there is .isdigit() method of a string. Is there something like that for floats?

Here is the current (slow) code.

    result = []
    for line in lines:
        resline = []
        for item in line:
            try:
                resline.append(int(item))       
            except:        
                try:
                    resline.append(float(item))     
                except:
                    resline.append(item)            
        result.append(resline)          
    return np.array(result)

There is also some evidence (https://stackoverflow.com/a/2356970/3642151) that regex approach is even slower.

Community
  • 1
  • 1
DLunin
  • 1,050
  • 10
  • 20
  • 2
    possible duplicate of [How to check whether string might be type-cast to float in Python?](http://stackoverflow.com/questions/2356925/how-to-check-whether-string-might-be-type-cast-to-float-in-python) – Nir Alfasi Jul 26 '14 at 20:31
  • Could you provide an example of the array? – Jon Clements Jul 26 '14 at 20:33
  • @JonClements http://www.kaggle.com/c/digit-recognizer/data train.csv or test.csv – DLunin Jul 26 '14 at 20:34
  • @wrwt thanks - not quite sure I fancy downloading a 40mb+ file to assist :) – Jon Clements Jul 26 '14 at 20:40
  • I'm not sure why you call this code "slow". Did you actually do any kind of profiling? Mind to share the stats? I'd bet int/float + try-except approach is fast enough. – KurzedMetal Jul 26 '14 at 20:48
  • @KurzedMetal Yes, I did. – DLunin Jul 26 '14 at 20:49
  • @KurzedMetal I can't give the actual profiling results now, but it took about 20 seconds for converting 3 arrays of approx. 800x100 elements each. – DLunin Jul 26 '14 at 20:54
  • FWIW, I have run into the same problem before, and I found that the answer is **entirely dependent on your input data**. If your input is mostly strings that can be converted to floats, the try/except method is fastest. If most of them cannot, the regex method is fastest. If there is no way to tell beforehand, choose the one that makes you feel good inside. – SethMMorton Jul 26 '14 at 21:06
  • @SethMMorton I believe there should be better ways of doing such a simple task, aside from slowpoke exceptions, and regex, which is clearly an overkill – DLunin Jul 26 '14 at 21:10
  • Well, you've clearly found a hole in the python ecosystem. I'll bet you could write a pretty popular library if you can figure out a *general* way to do this quickly. I'd use it! – SethMMorton Jul 26 '14 at 21:12
  • If you familiarize yourself with the Python C-API, you will be able to imagine why this is a slow process. As far as I can tell, any hope of making this fast will likely be a C extension. – SethMMorton Jul 26 '14 at 21:14
  • @SethMMorton Yep, I thought of a C extension, but it's an overkill from another viewpoint :) – DLunin Jul 26 '14 at 21:21
  • For sure, but if you upload a library to PyPI that 10000 people use, it's probably not overkill anymore. Either way, best of luck. – SethMMorton Jul 26 '14 at 21:23
  • IMO this is an XY problem, I actually ran some profiling myself and try-except/float/int add almost no overhead. The inefficiency is somewhere else, like Seth said, probably on how you process your input data. – KurzedMetal Jul 26 '14 at 21:54

3 Answers3

2

Your return value shows you are using NumPy. Therefore, you should be using np.loadtxt or np.genfromtxt (with the dtype=None parameter) to load the lines into a NumPy array. The dtype=None parameter will automatically detect if the string can be converted to a float or int.

np.loadtxt is faster and requires less memory than np.genfromtxt, but requires you to specify the dtype -- there is no dtype=None automatic-dtype-detection option. See Joe Kington's post for a comparsion.

If you find loading the CSV using np.loadtxt or np.genfromtxt is still too slow, then using Panda's read_csv function is much much faster, but (of course) would require you to install Pandas first, and the result would be a Pandas DataFrame, not a NumPy array. DataFrames have many nice features (and can be converted into NumPy arrays), so you may find this to be an advantage not only in terms of loading speed but also for data manipulation.


By the way, if you don't specify the dtype in the call

np.array(data)

then np.array uses a single dtype for all the data. If your data contains both ints and floats, then np.array will return an array with a float dtype:

In [91]: np.array([[1, 2.0]]).dtype
Out[91]: dtype('float64')

Even worse, if your data contains numbers and strings, np.array(data) will return an array of string dtype:

In [92]: np.array([[1, 2.0, 'Hi']]).dtype
Out[92]: dtype('S32')

So all the hard work you go through checking which strings are ints or floats gets destroyed in the very last line. np.genfromtxt(..., dtype=None) gets around this problem by returning a structured array (one with heterogenous dtype).

Community
  • 1
  • 1
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
2

Try profiling your Python script, you'll find out that try... except, float or int are not the most time consuming calls in your script.

import random
import string
import cProfile

def profile_str2float(calls):
    for x in xrange(calls):
        str2float(random_str(100))

def str2float(string):
    try:
        return float(string)
    except ValueError:
        return None

def random_str(length):
    return ''.join(random.choice(string.lowercase) for x in xrange(length))

cProfile.run('profile_str2float(10**5)', sort='cumtime')

Running this script I get the following results:

         40400003 function calls in 14.721 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   14.721   14.721 <string>:1(<module>)
        1    0.126    0.126   14.721   14.721 str2float.py:5(profile_str2float)
   100000    0.111    0.000   14.352    0.000 str2float.py:15(random_str)
   100000    1.413    0.000   14.241    0.000 {method 'join' of 'str' objects}
 10100000    4.393    0.000   12.829    0.000 str2float.py:16(<genexpr>)
 10000000    7.115    0.000    8.435    0.000 random.py:271(choice)
 10000000    0.760    0.000    0.760    0.000 {method 'random' of '_random.Random' objects}
 10000000    0.559    0.000    0.559    0.000 {len}
   100000    0.242    0.000    0.242    0.000 str2float.py:9(str2float)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

As you can see from the Cumulative Time stat, str2float function is not consuming much CPU time, in 100.000 calls it barely uses 250ms.

KurzedMetal
  • 12,540
  • 6
  • 39
  • 65
2

All generalizations are false (irony intended). One cannot say that try: except: is always faster than regex or vice versa. In your case, regex is not overkill and would be much faster than the try: except: method. However, based on our discussions in the comments section of your question, I went ahead and implemented a C library that efficiently performs this conversion (since I see this question a lot on SO); the library is called fastnumbers. Below are timing tests using your try: except: method, using regex, and using fastnumbers.


from __future__ import print_function
import timeit

prep_code = '''\
import random
import string
x = [''.join(random.sample(string.ascii_letters, 7)) for _ in range(10)]
y = [str(random.randint(0, 1000)) for _ in range(10)]
z = [str(random.random()) for _ in range(10)]
'''

try_method = '''\
def converter_try(vals):
    resline = []
    for item in vals:
        try:
            resline.append(int(item))
        except ValueError:
            try:
                resline.append(float(item))
            except ValueError:
                resline.append(item)

'''

re_method = '''\
import re
int_match = re.compile(r'[+-]?\d+$').match
float_match = re.compile(r'[-+]?\d*\.?\d+(?:[eE][-+]?\d+)?$').match
def converter_re(vals):
    resline = []
    for item in vals:
        if int_match(item):
            resline.append(int(item))
        elif float_match(item):
            resline.append(float(item))
        else:
            resline.append(item)

'''

fn_method = '''\
from fastnumbers import fast_real
def converter_fn(vals):
    resline = []
    for item in vals:
        resline.append(fast_real(item))

'''

print('Try with non-number strings', timeit.timeit('converter_try(x)', prep_code+try_method), 'seconds')
print('Try with integer strings', timeit.timeit('converter_try(y)', prep_code+try_method), 'seconds')
print('Try with float strings', timeit.timeit('converter_try(z)', prep_code+try_method), 'seconds')
print()
print('Regex with non-number strings', timeit.timeit('converter_re(x)', prep_code+re_method), 'seconds')
print('Regex with integer strings', timeit.timeit('converter_re(y)', prep_code+re_method), 'seconds')
print('Regex with float strings', timeit.timeit('converter_re(z)', prep_code+re_method), 'seconds')
print()
print('fastnumbers with non-number strings', timeit.timeit('converter_fn(x)', prep_code+fn_method), 'seconds')
print('fastnumbers with integer strings', timeit.timeit('converter_fn(y)', prep_code+fn_method), 'seconds')
print('fastnumbers with float strings', timeit.timeit('converter_fn(z)', prep_code+fn_method), 'seconds')
print()

The output looks like this on my machine:

Try with non-number strings 55.1374599934 seconds
Try with integer strings 11.8999788761 seconds
Try with float strings 41.8258318901 seconds

Regex with non-number strings 11.5976541042 seconds
Regex with integer strings 18.1302199364 seconds
Regex with float strings 19.1559209824 seconds

fastnumbers with non-number strings 4.02173805237 seconds
fastnumbers with integer strings 4.21903610229 seconds
fastnumbers with float strings 4.96900391579 seconds

A few things are pretty clear

  • try: except: is very slow for non-numeric input; regex beats that handily
  • try: except: becomes more efficient if exceptions don't need to be raised
  • fastnumbers beats the pants off both in all cases

So, if you don't want to use fastnumbers, you need to assess if you are more likely to encounter invalid strings or valid strings, and base your algorithm choice on that.

SethMMorton
  • 45,752
  • 12
  • 65
  • 86