1

I'm working on a project in pandas on python. I receive as input a .csv file like this:

Name,Time,Data
A,5.6,"[1,2,3]"
A,1.2,"[1.4,3,8.9]"
...
B,3.4,"[0.2,3,5.1]"
ecc..

(I have thousand of datas for everyname and like 10 names). So in pandas the table is:

Name   Time       Data
A       5.6      [1,2,3]
A       1.2      [1.4,3,8.9]
...      
B       3.4      [0.2,3,5.1]
...

I need to convert to another measure unit all the numbers of the list in the "Data" column (so basically, mutiply every number of the list by a scalar). I'm having problems because, in the csv that i receive, the datas are saved as strings. So firstly i've to convert the string to a list of floats, then multiply the 3 numbers in the list for a scalar (e.g. 2) and then convert again the list into a string.

I know that performing an operation on a whole column is like:

df['Data'] = df['Data'].apply(lambda x: x*2)

i can multiply every number of a list "a" in this way:

[x*2 for x in a]

and i can convert the string into a list with ast:

a = ast.literal_eval(a) # (and with a = str(a).strip('[]') i can return to the string)

but i can't combine these 3 things.

Do you have any solution? (not necessarily with the same methods i tried up here). Thank you in advance!

Kraton
  • 65
  • 1
  • 8
  • just join your 3 commands into `apply` and `lambda` as follows `df.Data.apply(lambda x: str([y*2 for y in ast.literal_eval(x)]))`. It seems odd that the last step you want to convert lists back to strings. – Andy L. Sep 07 '19 at 18:56

2 Answers2

4

I would suggest first converting everything into a list using ast.literal_eval:

import pandas as pd

df = pd.DataFrame({
    'Name': ['A', 'A'],
    'Time': [5.6, 1.2],
    'Data': ["[1, 2, 3]", "[1.4, 3, 8.9]"]
})

import ast

df['Data'] = df['Data'].apply(ast.literal_eval)

And then after that you can just use regular list operations:

df['Data'] = df['Data'].apply(lambda x: [i*2 for i in x])
print(df['Data'])

Output:

0         [2, 4, 6]
1    [2.8, 6, 17.8]
Name: Data, dtype: object

Edit:

To convert the series back into a string simply use apply again:

df['Data'] = df['Data'].apply(str)

I wouldn't really suggest this though - it's much more convenient to store a list as a list instead of a string representation. If you want to continue to use list operations, keeping it as a list is preferable.

Primusa
  • 13,136
  • 3
  • 33
  • 53
  • Thank you so much, it works! I've two question (one closely linked and one for another operation on the same table): 1. In this solution i don't understand where Python converts back the list into the string. 2. Datas are recorded at 80hz. If i want to resample the whole table at 50hz is:
     df.groupby('Name').apply(pd.DataFrame.sample, frac=0.625) 
    a valid solution? I used the groupby to have the same number of datas (i've thousand of datas and 10 names) for every name to avoid a random sample.
    – Kraton Sep 07 '19 at 19:27
  • 1
    1. In this solution python doesn't convert the list back into a string. You can do that pretty easily though with `df['Data'] = df['Data'].apply(str)`, although I'm not sure why you would - all the operations you do would be a lot more convenient on lists – Primusa Sep 07 '19 at 19:30
  • 1
    2. `pandas` actually has a `pd.resample` function that you should look into - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html – Primusa Sep 07 '19 at 19:31
  • 1. I asked because, after i do the operation and overwrite the file with .to_csv(), in the .csv file every list is converted back as string (and that's what i wanted, i just wanted to understand how it happens.. maybe the df.to_csv() do the job?) 2. I tried to use it df.groupby('SensorID').resample("0.625S") and variants but i always get errors like "TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'" – Kraton Sep 07 '19 at 20:00
  • 1
    @Elia resample is beyond the scope of this question. You'd have to convert your rangeindex to a datetimeindex. See: https://stackoverflow.com/questions/30857680/pandas-resampling-error-only-valid-with-datetimeindex-or-periodindex – Primusa Sep 07 '19 at 20:07
  • Ok, thank you so much for the help, i'll open a new question! – Kraton Sep 07 '19 at 20:24
  • Hi @Primusa, using this solution on another table gave me a problem. Infact, in this table, i've some datas in form of "[nan, nan, nan]", and the conversion with .ast.literal_eval fails and give me an error. Do you have any solution compatible with both dataframe with and without ["nan, nan, nan"] strings (maybe and if that recognize when there is a[nan] string and skip it i dunno..)? Thank you in advance :) – Kraton Sep 08 '19 at 22:01
  • @Kraton a really hacky solution would be to get rid of the nans by operating on the string and replacing them with python's `None` so `literal_eval` recognizes it, so something like `df['Data'] = df['Data'].str.replace("nan", "None")`. Besides that you may have to write your on code to parse the list or use a pandas primitive for reading strings. – Primusa Sep 08 '19 at 23:53
  • Your code solution succesfully replace my "nan" into "none" but i get this error: `TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'` (probably when the program reach `df['Data'] = df['Data'].apply(lambda x: [i*2 for i in x])`) – Kraton Sep 09 '19 at 00:56
  • 1
    A "nan" in your string implies a value that isn't filled - you shouldn't do multiplication on an empty column. You could try setting it to zero instead? – Primusa Sep 09 '19 at 01:10
  • Okay, setting it to 0 make the program works and it does the right operations. But there is one last problem, i've to convert back the [0,0,0] in to [nan,nan,nan] to mantain same format of the input.I tried `df['Data'] = df['Data'].str.replace("[0,0,0]", "[nan,nan,nan]")` but.. it replaces three consecutive zero with nan so i've numbers like 3.9[nan,nan,nan]85[nan,nan,nan] ahah (because in the Data column i've rows with both numbers and [nan,nan,nan] in the original .csv) There is an hacky solution for that? It's not a big deal, i can try to ask if i can leave [0,0,0] instead of [nan,nan,nan] – Kraton Sep 09 '19 at 01:41
  • 1
    @Kraton better off using `None` then and handling logic separately: `df['Data'] = df['Data'].str.replace("nan", "None"); df['Data'] = df['Data'].apply(lambda x: [i*2 if i is not None else None for i in x]))` – Primusa Sep 09 '19 at 02:09
  • Ok it works but one last problem showed up in the last table. Infact in this table, in the 'Data' column, i've strings like the other tables (so "[1,2,3]" "[1.2,2.4,3.6]") but also have single datas as float like 2.0, 125.1 ecc. (or even 0 instead of nan, so here i don't have to use the .str.replace solution). Naturally in this case the .ast fails because not all the column is made of strings. There is some way i can keep the .ast in this case or i've to change solution to another? – Kraton Sep 09 '19 at 11:43
  • 1
    Alright Kraton so these are separate questions and you need to make a new post about them. At this stage you obviously need to write your own parser. Python doesn't have anything builtin to parse lists/floats together. My understanding is that you can have an input like this: "[1,2,3] 3 [1,2,3]". If that's not the case and floats are always in a separate column you can convert them all to a string first and then literal eval. But given all the business logic you need to implement, a hacky solution isn't the way to go. – Primusa Sep 09 '19 at 15:08
2

You do not need to leverage ast - you can "clean" your list-text representation from [] " and use split(",") to get to a list of strings.

Use map to convert to float and multiply by your constant:

Create demo data file:

data = """Name,Time,Data
A,5.6,"[1,2,3]"
A,1.2,"[1.4,3,8.9]"
B,3.4,"[0.2,3,5.1]" """

with open("d.txt","w") as f:
    f.write(data)

Process demo data file:

import pandas as pd

df = pd.read_csv("d.txt")
print(df)

constant = 3
df['Data_2'] = df['Data'].apply(
    lambda x: [x*constant for x in map(float, x.strip("[]\" ").split(","))])


print(df)

Output:

  Name  Time          Data
0    A   5.6       [1,2,3]
1    A   1.2   [1.4,3,8.9]
2    B   3.4  [0.2,3,5.1] 

Output after conversion:

  Name  Time          Data                                         Data_2
0    A   5.6       [1,2,3]                                [3.0, 6.0, 9.0]
1    A   1.2   [1.4,3,8.9]   [4.199999999999999, 9.0, 26.700000000000003]
2    B   3.4  [0.2,3,5.1]   [0.6000000000000001, 9.0, 15.299999999999999]
Patrick Artner
  • 50,409
  • 9
  • 43
  • 69