2

I have what I assumed would be a super basic problem, but I'm unable to find a solution. The short is that I have a column in a csv that is a list of numbers. This csv that was generated by pandas with to_csv. When trying to read it back in with read_csv it automatically converts this list of numbers into a string.

When then trying to use it I obviously get errors. When I try using the to_numeric function I get errors as well because it is a list, not a single number.

Is there any way to solve this? Posting code below for form, but probably not extremely helpful:

def write_func(dataset):
    features = featurize_list(dataset[column])  # Returns numpy array
    new_dataset = dataset.copy()  # Don't want to modify the underlying dataframe
    new_dataset['Text'] = features
    new_dataset.rename(columns={'Text': 'Features'}, inplace=True)
    write(new_dataset, dataset_name)

def write(new_dataset, dataset_name):
    dump_location = feature_set_location(dataset_name, self)
    featurized_dataset.to_csv(dump_location)

def read_func(read_location):
    df = pd.read_csv(read_location)
    df['Features'] = df['Features'].apply(pd.to_numeric)

The Features column is the one in question. When I attempt to run the apply currently in read_func I get this error:

ValueError: Unable to parse string "[0.019636873200000002, 0.10695576670000001,...]" at position 0

I can't be the first person to run into this issue, is there some way to handle this at read/write time?

Slater Victoroff
  • 21,376
  • 21
  • 85
  • 144

2 Answers2

2

You want to use literal_eval as a converter passed to pd.read_csv. Below is an example of how that works.

from ast import literal_eval
form io import StringIO
import pandas as pd

txt = """col1|col2
a|[1,2,3]
b|[4,5,6]"""

df = pd.read_csv(StringIO(txt), sep='|', converters=dict(col2=literal_eval))
print(df)

  col1       col2
0    a  [1, 2, 3]
1    b  [4, 5, 6]
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Is this secure? `literal_eval` sketches me out quite a bit, and I don't have complete control over the input files here. They get pulled down from a remote server. – Slater Victoroff Apr 20 '17 at 18:13
  • I'm equally sketched out by `eval`... `literal_eval` is intended to alleviate that a fear by being safe parsing of literals. See [***this post***](http://stackoverflow.com/a/15197698/2336654) – piRSquared Apr 20 '17 at 18:14
  • This seems... doable, but is this really the only way to do it? It's pretty damn arcane for something that feels like a very basic use case. To be clear this *does* work though. – Slater Victoroff Apr 20 '17 at 18:16
  • No, it isn't... the other way is more painful. You can parse the string yourself. – piRSquared Apr 20 '17 at 18:19
  • You can also save the data as a `json` string within the csv and use `json.loads` in a converter. But I'd prefer `literal_eval` – piRSquared Apr 20 '17 at 18:21
  • 1
    @SlaterTyranus It is not that it's not a common use case but pandas mainly deals with numbers and strings. It doesn't support these kind of structures really well. If they are all lists, you can just use json to parse them (i.e. `json.loads('[1.0, 2.0]')`) I am not sure if this can be passed as a converter like piRSquared did, but it seems doable. – ayhan Apr 20 '17 at 18:21
  • But like, this isn't a random file. `pandas` wrote this out. Why should it default to writing something it can't read? I mean, it's not your fault obviously, but that just seems nuts to me. I thought `pandas` placed quite a bit of emphasis on clean interoperability with `numpy` data types. – Slater Victoroff Apr 20 '17 at 18:24
  • @SlaterTyranus I get your response. However, dates are analogous. You have to specify which fields are dates. pandas has a bit of api wrapped around parsing dates. I could see a `read_csv` flag that indicates which fields should be lists.. but I suspect it would be low priority. – piRSquared Apr 20 '17 at 18:29
  • @piRSquared I guess it's really on me. If I cared enough I would open a PR with that. – Slater Victoroff Apr 20 '17 at 18:35
1

I have modified your last function a bit and it works fine.

def read_func(read_location):
    df = pd.read_csv(read_location)
    df['Features'] = df['Features'].apply(lambda x : pd.to_numeric(x))
Mohammad Akhtar
  • 118
  • 1
  • 1
  • 7
  • This is not tractable for me due to performance reasons. It's quite a large file I'm converting and this iterates through every entry in every list. – Slater Victoroff Apr 20 '17 at 18:23