2

I'm using Python, and I have a dataset of 6 columns, R, Rc, J, T, Ca and Cb. I need to "aggregate" on the columns "R" then "J", so that for each R, each row is a unique "J". Rc is a characteristic of R. Ca and Cb are characteristics of T. It will make more sense looking at the table below.

I need to go from:

#______________________            ________________________________________________________________
#| R  Rc  J  T  Ca  Cb|           |# R  Rc  J  Ca(T=1)  Ca(T=2)  Ca(T=3)  Cb(T=1)  Cb(T=2)  Cb(T=3)|
#| a   p  1  1  x    d|           |# a  p   1    x         y        z        d        e        f   |
#| a   p  1  2  y    e|           |# b  o   1    w                           g                     |  
#| a   p  1  3  z    f|  ----->   |# b  o   2    v                           h                     | 
#| b   o  1  1  w    g|           |# b  o   3    s                           i                     |
#| b   o  2  1  v    h|           |# c  n   1    t         r                 j        k            |
#| b   o  3  1  s    i|           |# c  n   2    u                           l                     |
#| c   n  1  1  t    j|           |________________________________________________________________|
#| c   n  1  2  r    k|           
#| c   n  2  1  u    l|
#|____________________|

data = {'R' : ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'], 
        'Rc': ['p', 'p', 'p', 'o', 'o', 'o', 'n', 'n', 'n'],
        'J' : [1, 1, 1, 1, 2, 3, 1, 1, 2], 
        'T' : [1, 2, 3, 1, 1, 1, 1, 2, 1], 
        'Ca': ['x', 'y', 'z', 'w', 'v', 's', 't', 'r', 'u'],
        'Cb': ['d', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l']}

df = pd.DataFrame(data=data)

I don't want to lose the data in Rc, Ca, or Cb.

Rc (or each column that ends in 'c') is the same for each R, so that can just be grouped with R.

But Ca and Cb (or each column that starts with 'C') are unique for each T, which will be aggregated and otherwise lost. These need to instead be saved in new columns named Ca(T=1) for when T=1, Ca(T=2) for when T=2, and Ca(T=3) for when T=3. The same goes for Cb.

So using T, I need to create T number of columns for each Ca and Cb given T, that writes the data from Ca and Cb into the new columns.

PS. If it helps, columns J and T both have an extra column with unique IDs.

J_ID = [1,1,1,2,3,4,5,5,6]
T_ID = [1,2,3,4,5,6,7,8,9]

What I tried so far:

(
    df.groupby(['R','J'])
    .apply(lambda x: x.Ca.tolist()).apply(pd.Series)
    .rename(columns=lambda x: f'Ca{x+1}')
    .reset_index()
)

Problem: Only possible to do with one of the C's and I lose Rc.

Any help would be greatly appreciated!

CAPSLOCK
  • 6,243
  • 3
  • 33
  • 56
nielsen
  • 383
  • 1
  • 6

2 Answers2

2

You can use pivot_table (here the docs) with a lambda function as aggfunc argument:

table = pd.pivot_table(df, index = ['R','Rc','J'],values = ['Ca','Cb'],
                    columns = ['T'], fill_value = '', aggfunc = lambda x: ''.join(str(v) for v in x)).reset_index()


   R Rc  J Ca       Cb      
T           1  2  3  1  2  3
0  a  p  1  x  y  z  d  e  f
1  b  o  1  w        g      
2  b  o  2  v        h      
3  b  o  3  s        i      
4  c  n  1  t  r     j  k   
5  c  n  2  u        l      

Then you can remove the multiindex columns and rename as follow (taken from this great answer):

table.columns = ['%s%s' % (a, ' (T = %s)' % b if b else '') for a, b in table.columns]

   R Rc  J Ca (T = 1) Ca (T = 2) Ca (T = 3) Cb (T = 1) Cb (T = 2) Cb (T = 3)
0  a  p  1          x          y          z          d          e          f
1  b  o  1          w                                g                      
2  b  o  2          v                                h                      
3  b  o  3          s                                i                      
4  c  n  1          t          r                     j          k           
5  c  n  2          u                                l                      
CAPSLOCK
  • 6,243
  • 3
  • 33
  • 56
  • This returns the error: "TypeError: '<' not supported between instances of 'int' and 'datetime.datetime'" – nielsen Apr 20 '20 at 17:25
  • @nielsen What version of pandas are you using? Also, do you have a datetime column in your dataframe? If so, please update the mwe so that it represents the data you are working one =), then I'll fix the answer. I tested thison the mwe you provided and it works fine. – CAPSLOCK Apr 21 '20 at 06:55
  • Looks like I'm using version '0.25.1'. There are no datetime variables. I googled a bit and apparently this error normally shows up if you're trying to <>= two arguments that are not comparable like "is a > 14?" or "is 90 < [70,89, 100, 110]?", but that makes no sense since we are not trying to do that. Someone else had a similar problem, but the solution does not seem to help me https://stackoverflow.com/questions/47076050/merging-and-transposing-columns-in-python-gives-typeerror-not-supported-bet – nielsen Apr 21 '20 at 09:18
  • I thought I added a @. @Gio – nielsen Apr 21 '20 at 09:31
  • @nielsen the first suggestion I have is to update to pandas 1.0.3 and see if this fixes the problem. If it doesn't then it has to be something that makes your dataframe different than the mwe you posted here. If you check what the difference is and update your question I'll try to fix the answer – CAPSLOCK Apr 21 '20 at 12:13
  • Updating it didn't work. I tried removing all the "object" type variables from my list of columns I included in the index and values lists, and it removed the error, but it created an empty dataframe. Could it be that there are NaN? Could it be that the order of the columns in the list of columns is wrong? Does it matter? – nielsen Apr 22 '20 at 12:17
  • Hey @Gio. I found a solution that works. The Rc's were added later, so I simply took the data before those were added and used R. Then I merged the Rc's after. It doesn't explain the problem, but I did come to the other side happy. Something were wrong with the respondent charateristics. – nielsen Apr 22 '20 at 20:32
  • @nielsen Great to hear! You could check the data type of Rc using `df.Rc.dtype`. That might give us a hint of the problem with this variable. – CAPSLOCK Apr 23 '20 at 05:27
  • It's more than one variable and they vary from float, to integer to string. (none are dates though). – nielsen Apr 23 '20 at 08:33
  • Btw, all the variables showed up to look like this: R ---> ('R', '') – nielsen Apr 23 '20 at 08:38
0

If I understand what you need, you can simply locate the needed rows like this:

df['Ca(T=1)']=df['Ca'].loc[df['T']==1]

you have to repeat it for the different T's

coco18
  • 836
  • 8
  • 18
  • Ah I see, and then after I do that, I can simply groupby? (The solution also requires a reduction in rows according to J) – nielsen Apr 20 '20 at 09:31
  • 1
    yes, after you apply the function, you can do what ever you want the dataframe :) – coco18 Apr 20 '20 at 09:36