2

I have the DF of this kind:

pd.DataFrame({'label':['A','test1: A','test2: A','B','test1: B','test3: B'],
             'value': [1,2,3,4,5,6]})

      label  value
0         A      1
1  test1: A      2
2  test2: A      3
3         B      4
4  test1: B      5
5  test3: B      6

And I need to convert to this:

pd.DataFrame({'label':['A','B'],
             'value': [1,4],
             'test1:':[2,5],
             'test2:':[3,None],
             'test3:':[None,6]})

  label  value  test1:  test2:  test3:
0     A      1       2     3.0     NaN
1     B      4       5     NaN     6.0

I need to keep label for unique value and keys are merged to the right if present in the data. Keys may vary and be of different names for one value.

Feel free to share how to rename the question because I could not find the better way to name the problem.

EDIT: Partly this solution contains what I need however there is no decent way to add columns representing key in the label column. Ideally something like a function with df input is needed.

sumixam
  • 77
  • 1
  • 8
  • What have you tried so far ? – Suraj Jul 12 '20 at 14:36
  • @SurajSubramanian really nothing worked out. Thought about lambda function with apply, however, have no clue how to construct the result:( – sumixam Jul 12 '20 at 14:40
  • Does this answer your question? [Pandas long to wide reshape, by two variables](https://stackoverflow.com/questions/22798934/pandas-long-to-wide-reshape-by-two-variables) – NelsonGon Jul 12 '20 at 14:40
  • 1
    @NelsonGon the link you provided contains mostly the same problem as I have really, but it uses for loop for naming and creating new columns which is not efficient in case of larger data – sumixam Jul 12 '20 at 14:55
  • Right, I normally link duplicates to enable one see if it solves their issues. Won't retract. Feel free to add your comment to your question so someone else knows why this may not work for you. – NelsonGon Jul 12 '20 at 14:57

2 Answers2

2

Extract information into two data frames and merge them.

df2 = df[df['label'].str.contains('test')]
df3 = df2['label'].str.split(expand=True).rename(columns={0: "test", 1: "label"})
df3['value'] = df2['value']
df3 = df3.pivot_table(index='label', columns='test', values='value')
df2 = df[~df['label'].str.contains('test')]
df4 = pd.merge(df2, df3, on='label')

Output

  label  value  test1:  test2:  test3:
0     A      1     2.0     3.0     NaN
1     B      4     5.0     NaN     6.0
Pramote Kuacharoen
  • 1,496
  • 1
  • 5
  • 6
1

Here's a way to do that:

df.loc[~df.label.str.contains(":"), "label"] = df.loc[~df.label.str.contains(":"), "label"].str.replace(r"(^.*$)", r"value:\1")
labels = df.label.str.split(":", expand = True).rename(columns = {0: "label1", 1:"label2"})
df = pd.concat([df, labels], axis=1)
df = pd.pivot_table(df, index="label2", columns="label1", dropna=False)
df.columns = [c[1] for c in df.columns]
df.index.name = "label"

The output is:

       test1  test2  test3  value
label                            
A        2.0    3.0    NaN    1.0
B        5.0    NaN    6.0    4.0
Roy2012
  • 11,755
  • 2
  • 22
  • 35
  • I have a bit different out with val columns concatenated below. Also, do you know how to remove the label1,label2? There is some truth in your solution:) – sumixam Jul 12 '20 at 15:02
  • Slightly improved - changed the column names to drop the extra level. – Roy2012 Jul 12 '20 at 15:06
  • See the edits. This all makes sense, however, produces a bit different output in pivoting i think – sumixam Jul 12 '20 at 15:53