1

I have a df and I would like to use 2 of the columns (sample, var) so that the unique values from var column become the index and the unique values from sample other col become the new column names. I would then like to populate the table with 'True' wherever there was a row where a given sample and var co-occured in the original df and populate with na where sample and var did not co-occur.

sample, var
s1, v1
s1, v2
s2, v1
s2, v3

Would become:

    s1, s2
v1, T,  T
v2, T,  na
v3, na, T

I apologise if there is already a question that answers this already. I am new to pandas and am not sure the technical words to search. I tried this possible anser but it didn't work for me and returned a table with samples still in a column rather than as col-headers like this:

pivot = df_all.pivot(index='VAR', columns='SAMPLE').stack(dropna=False)
print(pivot.head(20))

var, sample
v1, s1
v1, s2
v2, s1
v3, s2
Empty dataframe
Columns: []
Index: []

I also tried this:

df_all['MUT']=True
pivot = df_all.pivot(index='VAR', columns='SAMPLE', values='MUT').stack(dropna=False)
print(pivot.head(20))

This returned the right information but the samples were still in a column rather than being the names of the columns as I expected.

The aim is to ultimately make a heatmap so if there is a better solution to achieve this then that would also be fine. I figure the answer is something simple that I missed but I've tried searching and can't find it. I could probably solve this using iteration but am really looking for a vector/pandas type approach. Many thanks.

user3062260
  • 1,584
  • 4
  • 25
  • 53

3 Answers3

3

I will recommend pd.crosstab

pd.crosstab(df['var'],df['sample']).replace({1:'T',0:np.nan})

sample   s1   s2
var             
v1        T    T
v2        T  NaN
v3      NaN    T
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Thanks for this, its really helpful a to have the extra functionality of being able to replace how the table is filled with a custom string. I would put this as the accepted answer but Chris's answer is the closest fix for my original code. – user3062260 Jul 16 '17 at 10:45
  • @user3062260 glad it help, I give back my answer for help not for reputation ~ have a nice day – BENY Jul 16 '17 at 14:45
2

You can use assign for new True column, then reshape by set_index with unstack:

df1 = df.assign(a=True).set_index(['var', 'sample'])['a'].unstack().replace({None:np.nan})
print (df1)
sample    s1    s2
var               
v1      True  True
v2      True   NaN
v3       NaN  True

Solution with pivot - for values is created True Series with same lenght as df:

df1 = pd.pivot(index=df['var'], 
               columns=df['sample'], 
               values=pd.Series([True] * len(df.index))) \
        .replace({None:np.nan})
print (df1)
sample    s1    s2
var               
v1      True  True
v2      True   NaN
v3       NaN  True

If duplicates, solution is same, only add first drop_duplicates:

print (df)
  sample var
0     s1  v1 <-dupe
1     s1  v1 <-dupe
2     s1  v2
3     s2  v1
4     s2  v3

df = df.drop_duplicates()
print (df)
  sample var
0     s1  v1
2     s1  v2
3     s2  v1
4     s2  v3
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • This is a really useful and well explained answer but is quite complicated compared to the other 2 solutions. The drop duplicates is a great extra thought though. Many thanks – user3062260 Jul 16 '17 at 10:50
2
df = pd.DataFrame({'sample': ['s1', 's1', 's2', 's2'], 'var': ['v1', 'v2', 'v1', 'v3']})
df['MUT'] = True
df =df.pivot(index='var', columns='sample')

print(df)

Output:

         MUT      
sample    s1    s2
var               
v1      True  True
v2      True  None
v3      None  True
Chris
  • 15,819
  • 3
  • 24
  • 37