Reshaping pandas DataFrame Python 3.x

Question

I have a pandas like so:

df =
cid    pid     purl    tid   turl     sid   surl
c1      p1     urlp1   t1    urlt1    s1    urls1
c1      p1     urlp1   t1    urlt1    s2    urls2
c1      p1     urlp1   t1    urlt1    s3    urls3
c2      p2     urlp2   t2    urlt2    s5    urls5
c2      p2     urlp2   t2    urlt2    s6    urls6

What I want is a result something like this:

cid     uid      url
c1      p1      urlp1 
c1      t1      urlt1
c1      s1      urls1   
c1      s2      urls2
c1      s3      urls3
c2      p2      urlp2   
c2      t2      urlt2
c2      s5      urls5
c2      s6      urls6

I tried using pd.melt to achieve this but I could only do it partially like so:

df2 = pd.melt(df, id_vars = 'cid', value_vars = ['pid','tid','purl'], value_name = 'userid')

How to get more columns? I need more columns for value_name? How can I achieve this?

You can split first dataframe into three data frame with these columns. `[cid, pid, urlp]`, `[cid, tid, urlt]`, `['cid, sid, urls']`. Once split you can merge all of these in one dataframe using cid. — Aditya, Aug 07 '18 at 09:13
Possible duplicate of [Simultaneously melt multiple columns in Python Pandas](https://stackoverflow.com/questions/51519101/simultaneously-melt-multiple-columns-in-python-pandas) — iacob, Aug 07 '18 at 09:20

jpp · Accepted Answer · 2018-08-07T09:22:41.757

One manual solution is to use a list comprehension. The steps are:

Set your index to the key column cid.
Calculate the length of your list comprehension loop from the number of columns.
Use concat with your list of dataframes and rename columns via pipe [add other formatting if required].
Finally, reset_index (to elevate the index to a series) and drop_duplicates.

Here's a demo:

df = df.set_index('cid')

def formatter(df):
    df.columns = ['uid', 'url']
    return df

n = int(len(df.columns) / 2)
L = [df.iloc[:, 2*i:2*(i+1)].pipe(formatter) for i in range(n)]
res = pd.concat(L, axis=0).reset_index().drop_duplicates()

print(res)

   cid uid    url
0   c1  p1  urlp1
3   c2  p2  urlp2
5   c1  t1  urlt1
8   c2  t2  urlt2
10  c1  s1  urls1
11  c1  s2  urls2
12  c1  s3  urls3
13  c2  s5  urls5
14  c2  s6  urls6

score 1 · Answer 2 · answered Aug 07 '18 at 09:21

Not sure this is the most straightforward way to do it but this is what I can think of:

import pandas as pd
from io import StringIO

s = """cid    pid     purl    tid   turl     sid   surl
c1      p1     urlp1   t1    urlt1    s1    urls1
c1      p1     urlp1   t1    urlt1    s2    urls2
c1      p1     urlp1   t1    urlt1    s3    urls3
c2      p2     urlp2   t2    urlt2    s5    urls5
c2      p2     urlp2   t2    urlt2    s6    urls6"""

df = pd.read_table(StringIO(s), sep='\\s+', header=0)
df2 = df.set_index('cid')
df3 = pd.concat([df2[['pid', 'purl']].rename(columns={'pid': 'uid', 'purl': 'url'}),
                 df2[['tid', 'turl']].rename(columns={'tid': 'uid', 'turl': 'url'}),
                 df2[['sid', 'surl']].rename(columns={'sid': 'uid', 'surl': 'url'})],
                axis=0)
result = df3.drop_duplicates().sort_index().reset_index()
print(result)

Output:

  cid uid    url
0  c1  p1  urlp1
1  c1  t1  urlt1
2  c1  s1  urls1
3  c1  s2  urls2
4  c1  s3  urls3
5  c2  p2  urlp2
6  c2  t2  urlt2
7  c2  s5  urls5
8  c2  s6  urls6

Thanks a lot guys, both answers are correct. I think that the first one is actually a better solution though and hence accepted. — Karan Gupta, Aug 07 '18 at 09:30
@KaranGupta Yes jpp's answer is cleaner and more extensible. — jdehesa, Aug 07 '18 at 09:31

score 0 · Answer 3 · answered Oct 18 '21 at 06:48

The columns have a pattern to them -> some end in id, some with url.

We can use this pattern to reshape the data, and remove duplicates.

This is only an alternative, and relies on pivot_longer from pyjanitor to help in the reshaping process:

# pip install pyjanitor
import pandas as pd
import janitor as jn
(df.pivot_longer(index = 'cid', 
                 names_to = ('uid', 'url'), 
                 names_pattern = (r".+id", r".+url"), 
                 sort_by_appearance = True)
   .drop_duplicates(ignore_index = True)
)
  cid uid    url
0  c1  p1  urlp1
1  c1  t1  urlt1
2  c1  s1  urls1
3  c1  s2  urls2
4  c1  s3  urls3
5  c2  p2  urlp2
6  c2  t2  urlt2
7  c2  s5  urls5
8  c2  s6  urls6

Reshaping pandas DataFrame Python 3.x

3 Answers3