7

Into a Pandas DataFrame, I'm reading a csv file that looks like:

          A              B
  +--------------+---------------+
0 |              | ("t1", "t2")  |
  +--------------+---------------+
1 | ("t3", "t4") |               |
  +--------------+---------------+

Two of the cells have literal tuples in them, and two of the cells are empty.

df = pd.read_csv(my_file.csv, dtype=str, delimiter=',',
    converters={'A': ast.literal_eval, 'B': ast.literal_eval})

The converter ast.literal_eval works fine to convert the literal tuples into Python tuple objects within the code – but only as long as there are no empty cells. Because I have empty cells, I get the error:

SyntaxError: unexpected EOF while parsing

According to this S/O answer, I should try to catch the SyntaxError exception for empty strings:

ast uses compile to compile the source string (which must be an expression) into an AST. If the source string is not a valid expression (like an empty string), a SyntaxError will be raised by compile.

However, I am not sure how to catch exceptions for individual cells, within the context of the read_csv converters.

What would be the best way to go about this? Is there otherwise some way to convert empty strings/cells into objects which literal_eval would accept or ignore?

NB: My understanding is that having literal tuples in readable files isn't always the best thing, but in my case it's useful.

jpp
  • 159,742
  • 34
  • 281
  • 339
P A N
  • 5,642
  • 15
  • 52
  • 103

2 Answers2

9

You can create a custom function which uses ast.literal_eval conditionally:

from ast import literal_eval
from io import StringIO

# replicate csv file
x = StringIO("""A,B
,"('t1', 't2')"
"('t3', 't4')",""")

def literal_converter(val):
    # replace first val with '' or some other null identifier if required
    return val if val == '' else literal_eval(val)

df = pd.read_csv(x, delimiter=',', converters=dict.fromkeys('AB', literal_converter))

print(df)

          A         B
0            (t1, t2)
1  (t3, t4)          

Alternatively, you can use try / except to catch SyntaxError. This solution is more lenient as it will deal with other malformed syntax, i.e. SyntaxError / ValueError caused by reasons other than empty values.

def literal_converter(val):
    try:
        return literal_eval(val)
    except SyntaxError, ValueError:
        return val
jpp
  • 159,742
  • 34
  • 281
  • 339
7

I would first read the data as normal, without literal_eval(). That gives us:

              A             B
0           NaN  ("t1", "t2")
1  ("t3", "t4")           NaN

Then I would do this:

df.fillna('()').applymap(ast.literal_eval)

Which gives:

          A         B
0        ()  (t1, t2)
1  (t3, t4)        ()

I think it's convenient to have tuples in all the cells, even the empty ones. This will make it easier to operate on the tuples later, for example:

newdf.sum(axis=1)

Which gives you:

0    (t1, t2)
1    (t3, t4)

Because "adding" tuples is concatenation. And even trickier but still very useful:

newdf.A.str[0]

Gives you:

0    NaN
1     t3

Because pd.Series.str, despite looking like it would only work on strings, works just fine on lists and tuples. So you can efficiently and uniformly index elements within each column's tuples.

John Zwinck
  • 239,568
  • 38
  • 324
  • 436