0

I am trying to write a code for converting CSV to ARFF. I import the values between each "," to a cell of an array, for example a instance such as:

Monday,176,49,203,27,77,38,Second

is converted to:

['Monday', '176', '49', '203', '27', '77', '38', 'Second']

The problem is that Python recognize each cell as string and you can see the recognized types by Python for the example:

[<type 'str'>, <type 'str'>, <type 'str'>, <type 'str'>, <type 'str'>, <type 'str'>, <type 'str'>, <type 'str'>]

I am looking for a way to distinguish between nominal and numerical attributes?

Sanjay T. Sharma
  • 22,857
  • 4
  • 59
  • 71
  • 1
    Have you tried this? http://stackoverflow.com/questions/354038/how-do-i-check-if-a-string-is-a-number-in-python – 808sound Jan 06 '13 at 05:34

3 Answers3

3

The best I can think of is something like this, using ast.literal_eval:

import ast

def converter(x):
    try:
        val = ast.literal_eval(x)
        return val
    except ValueError:
        return x

which gives

>>> seq = ['Monday', '176', '49', '203', '27', '77', '38', 'Second']
>>> newseq = [converter(x) for x in seq]
>>> newseq
['Monday', 176, 49, 203, 27, 77, 38, 'Second']
>>> map(type, newseq)
[<type 'str'>, <type 'int'>, <type 'int'>, <type 'int'>, <type 'int'>, <type 'int'>, <type 'int'>, <type 'str'>]

The advantage of using ast.literal_eval is that it handles more cases in a nice fashion:

>>> seq = ['Monday', '12.3', '(1, 2.3)', '[2,"fred"]']
>>> newseq = [converter(x) for x in seq]
>>> newseq
['Monday', 12.3, (1, 2.3), [2, 'fred']]
DSM
  • 342,061
  • 65
  • 592
  • 494
  • A good solution if there is a need to support hex literals, tuple types etc. in the text file but a bit heavy handed IMO if the text file contains simple text and numbers. Still, +1 – Sanjay T. Sharma Jan 06 '13 at 06:07
  • Well, the `isdigit` solutions will fail on negative numbers, or anything with a `+`, etc., so ISTM we might as well use the builtin. – DSM Jan 06 '13 at 06:12
  • I can always change the `isdigit` to check for the last digit though, that'll always work – Volatility Jan 06 '13 at 06:18
  • @Volatility: what about `"7-Eleven"` or `"Super-8"`? – DSM Jan 06 '13 at 06:18
  • @DSM: That's a very good point. @Volatility: But now that'll fail for short float literals like `2.` which end with `.` but is still a valid float. But I agree, this isn't a deal breaker since down the line we are anyway trying to parse the input using `int` and `float`. So we can always return back the original chunk if the `float` conversion fails. – Sanjay T. Sharma Jan 06 '13 at 06:19
  • Lol @DSM is that even valid? But yeah, I guess properly error-proofing the code would make it messy – Volatility Jan 06 '13 at 06:22
2
for i in lst:
    try:
        int(i)
        #whatever you want to do
    except ValueError:
        #error handling

That will work, although from this would be much better:

for i in lst:
    if i[-1].isdigit():  #it is a number
        #whatever
    else:
        #whatever else

Taken from here

See also: str.isdigit() method

Community
  • 1
  • 1
Volatility
  • 31,232
  • 10
  • 80
  • 89
1

If performance matters a lot here, I'll try to adopt a three step approach. This approach needlessly avoids casting a string to int or float and then failing by using a simple check for the first character.

  • For each chunk, check if the first character is a digit or not
  • If it is, first try parsing it as an int and if it fails, parse it as float
  • If all that fails, you have a big problem :)

Something like:

for chunk in chunks:
    if chunk[0].isdigit():
        try:
            return int(chunk)
        except ValueError:
            return float(chunk)
    else:
        # It's a string (a non-numeric entity)
        return chunk

You'll of course need a bit more special handling for supporting hex/oct literals in the text/csv file but I don't think that's a normal case for you?

EDIT: Come to think of it, Volatility has used a similar approach with the only difference being calling isdigit on the entire string instead of just the first character. This might take a wee bit more time if we have long numeric sequences in which isdigit is called on each and every char whereas my approach always checks for the first char so might be a bit faster.

Community
  • 1
  • 1
Sanjay T. Sharma
  • 22,857
  • 4
  • 59
  • 71