-1

I am trying to convert multiple columns to multiple rows. Can someone please offer some advice?

I have DataFrame:

id .        values
1,2,3,4     [('a','b'), ('as','bd'),'|',('ss','dd'), ('ws','ee'),'|',('rr','rt'), ('tt','yy'),'|',('yu','uu'), ('ii','oo')]

I need it to look like this:

ID       Values
1         ('a','b'), ('as','bd')
2         ('ss','dd'), ('ws','ee')
3         ('rr','rt'), ('tt','yy')
4         ('yu','uu'), ('ii','oo')

I have tried groupby, split, izip. Maybe I am not doing it the right way?

S3DEV
  • 8,768
  • 3
  • 31
  • 42
riderg28
  • 15
  • 6

2 Answers2

0

I made a quick and dirty example how you could parse this dataframe

# example dataframe
df = [
    "1,2,3,4",
    [('a','b'), ('as','bd'), '|', ('ss','dd'), ('ws','ee'), '|', ('rr','rt'), ('tt','yy'), '|', ('yu','uu'), ('ii','oo')]
]

# split ids by comma
ids = df[0].split(",")

# init Id and Items as int and dict()
Id = 0
Items = dict()

# prepare array for data insert
for i in ids:
    Items[i] = []

# insert data
for i in df[1]:
    if isinstance(i, (tuple)):
        Items[ids[Id]].append(i)
    elif isinstance(i, (str)):
        Id += 1

# print data as written in stackoverflow question
print("id .\tvalues")
for item in Items:
    print("{}\t{}".format(item, Items[item]))
Sebastian Waldbauer
  • 674
  • 1
  • 10
  • 17
0

I came up with a quite concise solution, based on multi-level grouping, which in my opinion is to a great extent pandasonic.

Start from defining the following function, "splitting" a Series taken from individual values element into a sequence of lists representations, without surrounding [ and ]. The splitting occurs at each '|' element.:

def fn(grp1):
    grp2 = (grp1 == '|').cumsum()
    return grp1[grp1 != '|'].groupby(grp2).apply(lambda x: repr(list(x))[1:-1])

(will be used a bit later).

The first step of processing is to convert id column into a Series:

sId = df.id.apply(lambda x: pd.Series(x.split(','))).stack().rename('ID')

For your data the result is:

0  0    1
   1    2
   2    3
   3    4
Name: ID, dtype: object

The first level of MultiIndex is the index of the source row and the second level are consecutive numbers (within the current row).

Now it's time to perform similar conversion of values column:

sVal = pd.DataFrame(df['values'].values.tolist(), index= df.index)\
    .stack().groupby(level=0).apply(fn).rename('Values')

The result is:

0  0      ('a', 'b'), ('as', 'bd')
   1    ('ss', 'dd'), ('ws', 'ee')
   2    ('rr', 'rt'), ('tt', 'yy')
   3    ('yu', 'uu'), ('ii', 'oo')
Name: Values, dtype: object

Note that the MultiIndex above has the same structure as in the case of sId.

And the last step is to concat both these partial results:

result = pd.concat([sId, sVal], axis=1).reset_index(drop=True)

The result is:

  ID                      Values
0  1    ('a', 'b'), ('as', 'bd')
1  2  ('ss', 'dd'), ('ws', 'ee')
2  3  ('rr', 'rt'), ('tt', 'yy')
3  4  ('yu', 'uu'), ('ii', 'oo')
Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41