2

Many have referred me to this solution Pandas Merging 101. However, unlike the issue raised by the Pandas Merging 101, I am handling a single dataframe that needs to convert a specific column (Variable in my case) as a header for the Values Columns, while at the same time maintaining the Class column.

The dataframe has a Values column, a Variable column consisting of repetitive: var1,var2,var3 and a Class column of the repetitive classes:1,2,3,4,5 as shown below:

     Values    Variable   Class
0   1.274143     var1     1
1   0.120940     var2     1
2   2.031863     var3     1
3  -0.437285     var1     2
4  -0.530028     var2     2
5   1.057256     var3     2
6   2.540227     var1     3
7   4.169430     var2     3
8   3.561490     var3     3
9  -3.153675     var1     4
10 -4.907512     var2     4
11  0.419970     var3     4
12  1.788764     var1     5
13  0.901584     var2     5
14  0.823103     var3     5
15  1.740886     var1     1
16  3.215278     var2     1
17 -1.235741     var3     1
18  2.167600     var1     2
19  0.322950     var2     2
20  2.086752     var3     2
21  -0.708104    var1     3
22     .           .      .
23     .           .      .
24     .           .      .

I want the output to have the structure as follows:

    var1      var2     var3     Class
  1.274143  0.120940  2.031863   1
  1.740886  3.215278 -1.235741   1
   value      value      value   1
   value      value      value   1
   value      value      value   1
     .        .            .     .
     .        .            .     .
     .        .            .     .
-0.437285  -0.530028  1.057256   2
 2.167600  0.322950  2.086752    2
   value      value      value   2
   value      value      value   2
   value      value      value   2
     .        .            .     .
     .        .            .     .
     .        .            .     .
   value      value      value   3
   value      value      value   3
   value      value      value   3
     .        .            .     .
     .        .            .     .
     .        .            .     .

I tried to use pivot:

print(df.pivot(index = 'Class',columns ='Variable', values =['Values']))
#Note that I want the Class to be a column not as an index

but I got an error result:

ValueError: Index contains duplicate entries, cannot reshape

I also tried pivot table:

table = pd.pivot_table(df, values='Values', columns=['Variable'],index = ['Class'],aggfunc=None)

but my result is:

TypeError: Must provide 'func' or tuples of '(column, aggfunc).

EDIT:

This is a brute forced solution to my problem:

partial_result = pd.DataFrame(df,columns = ['Values','Variable','Class'])
column_headers = partial_result.Variable.unique().tolist()
Len_Var = int(len(column_headers))
result = pd.DataFrame(np.reshape(partial_result['Values'].values,(int(len(partial_result['Values'].values)/Len_Var),Len_Var)),columns = column_headers)
labels= pd.DataFrame(np.reshape(partial_result['Class'].values,(int(len(partial_result['Class'].values)/Len_Var),Len_Var)))
result['Class'] = labels[0]
print(result)

The code works but if you have a more elegant solution, would be great to look into it.

Gerard
  • 518
  • 4
  • 19
  • `pivot` works just fine on your sample data. Probably your full data is not consistent. Try `pivot_table` with `aggfunc="first"`? – Henry Yik Sep 13 '20 at 17:15

3 Answers3

0

Could it be possible that you have several observation per classes for the same variable ? It seems to be the issue in the first place.

If it is the case, you have to precise how you want your several observations aggregated.

cdorid
  • 21
  • 5
  • Yes, I think this might have caused the issue, I prefer not to aggregate the observations as each variable represents a point in the feature space. – Gerard Sep 13 '20 at 17:38
0

This is what you need:

pd.pivot_table(df,index = 'Class',columns ='Variable')
IoaTzimas
  • 10,538
  • 2
  • 13
  • 30
0

EDIT: I am adding an alternate solution here after understanding the problem better, i.e. assuming that every 3 rows in the original dataframe (var1, var2, var3 - Class) actually represent a point in 3-dimensional space. In this scenario, I suggest you first create an "ID" column that will let us group each 3-row group together, so we know how to group them later when unstacking or pivoting (therefore answering the question Which 3 values should go into each row?).

df['id'] = np.divmod(np.arange(len(df)), 3)[0] + 1 
df = df.set_index(['id', 'Variable', 'Class']).unstack('Variable') # similar to pivot

If you also want to manipulate the dataframe a bit to bring it to the format you needed, you can also run:

df.columns = [x[1] for x in df.columns] # replace the MultiIndex column names if you don't need them
df.reset_index(inplace=True) # Now you're back in the format you needed originally, with `Class` as a column and each row as a point in space

ORIGINAL ANSWER:

So, with the data you have provided, it seems like it doesn't need to be aggregated in some way. However, to get the error you got (about duplication), it means that you probably have multiple Value entries for each Class-Variable combinations. So you need to choose how you want to aggregate them, i.e. what function you want to apply. For example, you could do the following:

pd.pivot_table(data=df, index='Class', columns='Variable', aggfunc='mean') # you could also get the median, max, etc. 

This will return the column names as multi-index, and the Class as index. To fix that, I would do:

df.columns = [x[1] for x in df.columns] # gets the 2nd value of the MultiIndex, like var1, var2...
df.reset_index(inplace=True) # inplace=True if you want to replace it permanently

The output should look like this:

        Class   var1        var2        var3
0       1       1.274143    0.120940    2.031863
1       2       -0.437285   -0.530028   1.057256
2       3       2.540227    4.169430    3.561490
3       4       -3.153675   -4.907512   0.419970
tania
  • 2,104
  • 10
  • 18
  • Hi Tania, thanks for the detailed explanation but I cannot do aggregation because the variables var1, var2,var3 represents a unique point in the 3D feature space for the classes. – Gerard Sep 13 '20 at 18:04
  • Hi Gerard, maybe I'm not understanding the problem then well. So, if, for every class, there should be only 1 `var1`, but you have duplicates in your data, how do you want to deal with them? Or should there be no duplicates at all? Is it true that for every class - variable, there should only be 1 unique value in your data? – tania Sep 13 '20 at 18:19
  • Thanks for the reply, each var1,var2,var3 represents a point in space, each point belongs to one of the classes:1,2,3,4 and 5. Imagine a 3D scatter plot with 3 axis composed of var1, var2,var3, and let the class represented by a color [like a cluster plot in 3D space]. – Gerard Sep 13 '20 at 18:25
  • Hi Tania, posted a solution to my problem. I took the idea from your discussion. – Gerard Sep 13 '20 at 19:35
  • I took a look at it again, thanks for the heads up. I think maybe then there's something I'm missing here. If [var1, var2, var3] represents a unique point in space, that can then be mapped to a Class, and there are multiple points [var1, var2, var3] that can be of the same Class, then how do we know which key to group by the vars in this table? When I look at the first dataframe you posted, how do I know what connects each triplet of vars and therefore should tell me to put them in the same row? Is there some "observation ID" missing maybe? (see rows 0-2 and 15-17) – tania Sep 13 '20 at 20:44
  • 1
    @GerardOmpad I had an idea: if your "points" are grouped by the order that they appear in (i.e. rows 0-1-2 are one point, 15-16-17 are another point, and so forth), you could first assign them a unique ID based on this knowledge, and then use the unstack() trick: `df['id'] = np.divmod(np.arange(len(df)),3)[0]+1` `df.set_index(['id','Variable','Class']).unstack('Variable')` – tania Sep 13 '20 at 21:05
  • This is more elegant than my solution. – Gerard Sep 15 '20 at 15:17