Python/Pandas Unpacking of Field String into Multiple Fields

Question

Several time series were measured for the same objects. Unfortunately, the x and y coordinates are all put together into two comma-separated strings. To make things more complicated, the number of time series and the x coordinates varied between time series.

So, for example, I have a data frame that looks something like:

Object   Overall_Prop   X                                              Y
obj1         4.5        "0, 1, 3, 6, 1, 3, 5, 7, 0, 1, 3, 5, 7"   "3, 9, 10, 11, 8, 10, 12, 14, 3.1, 8.5, 9, 12.5, 14.5"
obj2         9.9        "1, 3, 6, 9"                               "7, 9, 10, 14.2"

What I would like to have is a data frame that looks like this one:

Object    Overall_Prop  Curve  X  Y
obj1            4.5        1   0  3
obj1            4.5        1   1  9
obj1            4.5        1   3  10
obj1            4.5        1   6  11
obj1            4.5        2   1  8
obj1            4.5        2   3  10
obj1            4.5        2   5  12
obj1            4.5        2   7  14
obj1            4.5        3   0  3.1
obj1            4.5        3   1  8.5
obj1            4.5        3   3  9
obj1            4.5        3   5  12.5
obj1            4.5        3   7  14.5
obj2            9.9        1   1  7
obj2            9.9        1   3  9
obj2            9.9        1   6  10
obj2            9.9        1   9  14.2

By the way, this question is different from pandas: how do I split a text in a column into multiple rows because here we have two columns and the resulting field must be paired appropriately.

Hence the additional complication.

Frankly, I do not have a clue how to get it started. I suppose I could try with for loops but I have a feeling that if I could create a multi-index somehow, I could then go from a wide to a long format. But then again, the number of time series is variable so I do not know how to do this without processing the strings using for loops somehow. — JDS, Apr 12 '18 at 14:18
Here's a similar post: https://stackoverflow.com/questions/17116814/pandas-how-do-i-split-text-in-a-column-into-multiple-rows/21032532 — pault, Apr 12 '18 at 14:22

score 2 · Answer 1 · answered Apr 12 '18 at 14:28

Here is one way. I have not included Curve column as it is unclear how this is defined.

import pandas as pd
from itertools import chain

df = pd.DataFrame({'Object': ['obj1', 'obj2'],
                   'Overall_Prop': [4.5, 9.9],
                   'X': ['0, 1, 3, 6, 1, 3, 5, 7, 0, 1, 3, 5, 7', '1, 3, 6, 9'],
                   'Y': ['3, 9, 10, 11, 8, 10, 12, 14, 3.1, 8.5, 9, 12.5, 14.5', '7, 9, 10, 14.2']})

df['X'] = [list(map(float, x)) for x in df['X'].str.split(', ')]
df['Y'] = [list(map(float, x)) for x in df['Y'].str.split(', ')]

lens = list(map(len, df['X']))

res = pd.DataFrame({'Object': np.repeat(df['Object'], lens),
                    'Overall_Prop': np.repeat(df['Overall_Prop'], lens),
                    'X': list(chain.from_iterable(df['X'])),
                    'Y': list(chain.from_iterable(df['Y']))}).reset_index(drop=True)

print(res)

#    Object  Overall_Prop    X     Y
# 0    obj1           4.5  0.0   3.0
# 1    obj1           4.5  1.0   9.0
# 2    obj1           4.5  3.0  10.0
# 3    obj1           4.5  6.0  11.0
# 4    obj1           4.5  1.0   8.0
# 5    obj1           4.5  3.0  10.0
# 6    obj1           4.5  5.0  12.0
# 7    obj1           4.5  7.0  14.0
# 8    obj1           4.5  0.0   3.1
# 9    obj1           4.5  1.0   8.5
# 10   obj1           4.5  3.0   9.0
# 11   obj1           4.5  5.0  12.5
# 12   obj1           4.5  7.0  14.5
# 13   obj2           9.9  1.0   7.0
# 14   obj2           9.9  3.0   9.0
# 15   obj2           9.9  6.0  10.0
# 16   obj2           9.9  9.0  14.2

Thanks @jpp, this looks very good. I am going to try it next but before I wanted to clarify what "Curve" meant. Overtime that the X value goes from a higher to a lower value, a new curve is started. So in your print out, curve number 2 for obj1 starts when X drops from 6.0 to 1.0. By the way, this solution seems much different and of wider applicability from the ones suggested in "pandas: How do I split text in a column into multiple columns". — JDS, Apr 12 '18 at 15:30
@JDS. Yep, the solution is different. This one is more efficient, the other one is more pandas-oriented. Which you prefer is your choice. Re: Curve, I suggest you ask as a separate question if you can't add this column yourself (after the prerequisite research). — jpp, Apr 12 '18 at 15:32

Python/Pandas Unpacking of Field String into Multiple Fields

1 Answers1