0

I want to read text files such as CSV, or tab-separated values, but only columns which can be cast to numbers. For instance, if a column contains only strings, I'd like not to read it. This is because I'd like to avoid working with numpy arrays of multiple data types (I would be interested to know why I should, however questions should not be subject to discussion).

Various questions were quite close to mine (see 1, 2 and 3). However, apart from 1, they focus on converting the string rather than check which data type it could be converted to, which is why I mostly followed 1 to obtain the desired result.

Numpy's genfromtxt already does something close (by using the "dtype" argument). I suppose I could use this (with "dtype" set to None), and then just checking each column's data type.

Here's what I have so far:

def data_type(string_to_test):
"""
Checks to which data type a string can be cast. The main goal is to convert strings to floats.
The hierarchy for upcasting goes like this: int->float->complex->string. Thus, int is the most restrictive.
:param string_to_test: A string of character that hypothetically represents a number
:return: The most restrictive data type to which the string could be cast.
"""
# Do this to convert also floats that use coma instead of dots:
string_to_test = string_to_test.replace(',', '.')
# First, try to switch from string to int:
try:
    # This will yield True if string_to_test represents an int, or a float that is equal to an int (e.g.: '1.0').
    if int(float(string_to_test)) == float(string_to_test):
        return int
    else:
        int(string_to_test)
        return float
except ValueError:
    # If it doesn't work, try switching from string to float:
    try:
        float(string_to_test)
        return float
    except ValueError:
        # Happens with complex numbers and types (e.g.: float(4 + 3j), or float(float64)).
        # If this still doesn't work, try switching from string to complex:
        try:
            # Make sure spaces between operators don't cause any problems (e.g.: '1 + 4j' will not work,
            # while '1+4j' will).
            complex(string_to_test.replace(' ', ''))
            return complex
        # If none of the above worked, the string is said not to represent any other data types  (remember this
        # function is supposed to be used on data that is read from files, so checking only for those types should
        #  be exhaustive enough).
        except ValueError:
            return str

My biggest problem with this is that I find it rather ugly, and also that There could be cases I haven't thought about. Thus my question is "could it be done in a better way?".

Also, I'm interested to know when would it be better to return strings representing that data type, instead of the class itself (e.g.: return 'complex' as a string instead of complex, the class). For instance, I know I can use both (the string or the class) when using the method astype for numpy arrays.

Thanks in advance!

Community
  • 1
  • 1

1 Answers1

1

Same logic, less-"ugly" presentation:

def data_type(string_to_test, types=(int,float,complex)):
    string_to_test = string_to_test.replace(' ', '')
    for typ in types:
        try: value = typ(string_to_test)
        except ValueError: pass
        else: break
    else: typ = str 
    # special cases:
    if typ is float and int in types and value == int(value): typ = int
    if typ is int and bool in types and value == bool(value): typ = bool
    return typ

This also lends itself being extended a little more easily in that you can pass a different hierarchy of types—note that, analogous to your rule for "boiling down" a float into an int, I've included a rule for further boiling down an int into a bool if bool is one of the desired types (by default it isn't, since you don't specify it in the question, but it could be).

I would keep the resulting type object on the principle of not throwing away information when you don't need to (you can always access its .__name__ if you want a string).

jez
  • 14,867
  • 5
  • 37
  • 64
  • 1
    do not put your try / except / else, if as single lines, readability matters! – DevLounge Dec 26 '16 at 17:16
  • I'm all for readability, but I think that can be a more subjective thing than people admit. I definitely find the current layout *more* readable than the indented one. (Similarly, I hate a lot of PEP-8 and resist in Holy Warrior fashion the idea that it is somehow objectively "right".) – jez Dec 26 '16 at 17:20
  • That's precisely the kind of answer I was looking for, thanks a lot! I'll keep the replacement of commas to dots, but I suppose I should not have mentioned it in the first place, since it's not directly related to my question. – SHOWMEWHATYOUGOT Dec 26 '16 at 18:03