I want to read text files such as CSV, or tab-separated values, but only columns which can be cast to numbers. For instance, if a column contains only strings, I'd like not to read it. This is because I'd like to avoid working with numpy arrays of multiple data types (I would be interested to know why I should, however questions should not be subject to discussion).
Various questions were quite close to mine (see 1, 2 and 3). However, apart from 1, they focus on converting the string rather than check which data type it could be converted to, which is why I mostly followed 1 to obtain the desired result.
Numpy's genfromtxt already does something close (by using the "dtype" argument). I suppose I could use this (with "dtype" set to None), and then just checking each column's data type.
Here's what I have so far:
def data_type(string_to_test):
"""
Checks to which data type a string can be cast. The main goal is to convert strings to floats.
The hierarchy for upcasting goes like this: int->float->complex->string. Thus, int is the most restrictive.
:param string_to_test: A string of character that hypothetically represents a number
:return: The most restrictive data type to which the string could be cast.
"""
# Do this to convert also floats that use coma instead of dots:
string_to_test = string_to_test.replace(',', '.')
# First, try to switch from string to int:
try:
# This will yield True if string_to_test represents an int, or a float that is equal to an int (e.g.: '1.0').
if int(float(string_to_test)) == float(string_to_test):
return int
else:
int(string_to_test)
return float
except ValueError:
# If it doesn't work, try switching from string to float:
try:
float(string_to_test)
return float
except ValueError:
# Happens with complex numbers and types (e.g.: float(4 + 3j), or float(float64)).
# If this still doesn't work, try switching from string to complex:
try:
# Make sure spaces between operators don't cause any problems (e.g.: '1 + 4j' will not work,
# while '1+4j' will).
complex(string_to_test.replace(' ', ''))
return complex
# If none of the above worked, the string is said not to represent any other data types (remember this
# function is supposed to be used on data that is read from files, so checking only for those types should
# be exhaustive enough).
except ValueError:
return str
My biggest problem with this is that I find it rather ugly, and also that There could be cases I haven't thought about. Thus my question is "could it be done in a better way?".
Also, I'm interested to know when would it be better to return strings representing that data type, instead of the class itself (e.g.: return 'complex' as a string instead of complex, the class). For instance, I know I can use both (the string or the class) when using the method astype for numpy arrays.
Thanks in advance!