6

Description of the problem:

I have an array-like structure in a dataframe column as a string (I read the dataframe from a csv file).

One string element of this column looks like this:

In  [1]: df.iloc[0]['points']    
Out [2]: '[(-0.0426, -0.7231, -0.4207), (0.2116, -0.1733, -0.1013), (...)]'

so it's really an array-like structure, which looks 'ready for numpy' to me.

numpy.fromstring() doesn't help as it doesn't like brackets:
convert string representation of array to numpy array in python

A simple numpy.array() on the string itself, if I copy and paste it in the array() function is returning me a numpy array.
But if I fill the array() function with the variable containing the string like that: np.array(df.iloc[0]['points']) it does not work, giving me a ValueError: could not convert string to float

Convert string to numpy array

The question:

Is there any function to do that in a simple way (without replacing or regex-ing the brackets)?

jpp
  • 159,742
  • 34
  • 281
  • 339
swiss_knight
  • 5,787
  • 8
  • 50
  • 92
  • The first question is: where did this data come from? Is it something you’re generating? Or something generated by some program or library? If you can fix things so that the data get created in a form that’s actually meant to be parsed, or at least find an explanation of exactly what the format is and how you’re supposed to use it, that will be a lot better than reverse engineering by guessing so you can write a hacky parser. – abarnert Aug 17 '18 at 15:53
  • Anyway, this looks like someone wrote the repr of a list of tuples to a file. That’s a really bad idea, but if you can’t change that, you may be able to reverse it by calling `ast.literal_eval` on each string. That will work with the example you posted, but no guarantee that it will work with all of your data, or that you won’t get float rounding problems that wouldn’t be there with properly serialized data. It’s a hack, not a solution. – abarnert Aug 17 '18 at 15:55
  • The string (print) representation of a `structured` array is a list of tuples. The `repr` string will include `dtype` information. But since this is a cell in a DataFrame, there may be other possibilites. Did you load this `df` from a `csv` file? Are there quote strings like this in that file? – hpaulj Aug 17 '18 at 21:03

1 Answers1

10

You can use ast.literal_eval before passing to numpy.array:

from ast import literal_eval
import numpy as np

x = '[(-0.0426, -0.7231, -0.4207), (0.2116, -0.1733, -0.1013)]'

res = np.array(literal_eval(x))

print(res)

array([[-0.0426, -0.7231, -0.4207],
       [ 0.2116, -0.1733, -0.1013]])

You can do the equivalent with strings in a Pandas series, but it's not clear if you need to aggregate across rows. If this is the case, you can combine a list of NumPy arrays derived using the above logic.

The docs explain types acceptable to literal_eval:

Safely evaluate an expression node or a string containing a Python literal or container display. The string or node provided may only consist of the following Python literal structures: strings, bytes, numbers, tuples, lists, dicts, sets, booleans, and None.

So we are effectively converting a string to a list of tuples, which np.array can then convert to a NumPy array.

jpp
  • 159,742
  • 34
  • 281
  • 339
  • 1
    Wonderful! I didn't know this module! – swiss_knight Aug 17 '18 at 15:06
  • 1
    `literal_eval` is parsing the string that looks like a list of tuples. It handles basic Python structures. `JSON` does something similar, but for a more restricted syntax. – hpaulj Aug 17 '18 at 15:15