0

I have a tab separated file like this example:

small example:

chr5    112312630   112312650   31  chr5    112312630   112321662   DCP2    ENST00000543319.1
chr5    137676883   137676900   123 chr5    137676883   137676949   FAM53C  ENST00000434981.2
chr5    137676900   137676949   42  chr5    137676883   137676949   FAM53C  ENST00000434981.2
chr5    139944400   139944450   92  chr5    139944064   139946344   SLC35A4 ENST00000323146.3
chr5    139945450   139945500   77  chr5    139944064   139946344   SLC35A4 ENST00000323146.3

I want to group the lines based on 5th, 6th and 7th columns and sum the values of 4th column in each group. here is the expected output:

expected output:

chr5    112312630   112312650   31  chr5    112312630   112321662   DCP2    ENST00000543319.1
chr5    137676900   137676949   165 chr5    137676883   137676949   FAM53C  ENST00000434981.2
chr5    139944400   139944450   169 chr5    139944064   139946344   SLC35A4 ENST00000323146.3

I am trying to do that in python using the following command but it does not really work. do you know how to fix it?

import pandas as pd
df = pd.read_csv('myfile.txt', sep='\t', header=None)
df = df.groupby(5, 6, 7, 8).sum()
meW
  • 3,832
  • 7
  • 27
user10657934
  • 137
  • 10
  • Possible duplicate of [Pandas group-by and sum](https://stackoverflow.com/questions/39922986/pandas-group-by-and-sum) – Dani Mesejo Jan 02 '19 at 12:11

4 Answers4

1

You just have to group the cols:

df.groupby([5,6,7,8]).sum()
B. M.
  • 18,243
  • 2
  • 35
  • 54
  • No, it aggregate all numeric columns by `sum` and OP need only `4.` column, but also is necessary not lost all another columns - check my solution – jezrael Jan 02 '19 at 13:11
  • I put OP on the way ... After of course if can be tuned. – B. M. Jan 02 '19 at 13:35
1

You need aggregate by DataFrameGroupBy.agg with dictionary of columns with aggregated functions, here all column different by cols are aggregate by last or first, only 4 column is aggregated by sum:

cols = [5, 6, 7, 8]
d = dict.fromkeys(df.columns.difference(cols), 'last')
d[3] = 'sum'
print (d)
{0: 'last', 1: 'last', 2: 'last', 3: 'sum', 4: 'last'}

df = df.groupby([5, 6, 7, 8], as_index=False).agg(d).reindex(columns=df.columns)
print (df)
      0          1          2    3     4          5          6        7  \
0  chr5  112312630  112312650   31  chr5  112312630  112321662     DCP2   
1  chr5  137676900  137676949  165  chr5  137676883  137676949   FAM53C   
2  chr5  139945450  139945500  169  chr5  139944064  139946344  SLC35A4   

                   8  
0  ENST00000543319.1  
1  ENST00000434981.2  
2  ENST00000323146.3  

cols = [5, 6, 7, 8]
d = dict.fromkeys(df.columns.difference(cols), 'first')
d[3] = 'sum'
print (d)
{0: 'first', 1: 'first', 2: 'first', 3: 'sum', 4: 'first'}

df = df.groupby([5, 6, 7, 8], as_index=False).agg(d).reindex(columns=df.columns)
print (df)
      0          1          2    3     4          5          6        7  \
0  chr5  112312630  112312650   31  chr5  112312630  112321662     DCP2   
1  chr5  137676883  137676900  165  chr5  137676883  137676949   FAM53C   
2  chr5  139944400  139944450  169  chr5  139944064  139946344  SLC35A4   

                   8  
0  ENST00000543319.1  
1  ENST00000434981.2  
2  ENST00000323146.3  
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

Try this:

df.groupby(['column'])[['another column']].sum()

It groups by column and add sum of another column. I used [] so that you understand you can group by multiple columns, like this:

df.groupby(['column1', 'column2'])
Mehrdad Pedramfar
  • 10,941
  • 4
  • 38
  • 59
0

Input Dataframe: considering only first 3 rows,

data = {'col1': ['chr5', 'chr5', 'chr5'],
        'col2': [112312630,137676883,137676900],
        'col3': [112312650,137676900,137676949],
        'col4': [31, 123,42],
        'col5': ['chr5', 'chr5', 'chr5'],
        'col6': [112312630 ,137676883 ,137676883 ],
        'col7': [112321662, 137676949, 137676949],
        'col8': ['DCP2', 'FAM53C', 'FAM53C'],
       'col9': ['ENST00000543319.1', 'ENST00000434981.2', 'ENST00000434981.2']
       }

df = pd.DataFrame(data = data)
df 

Do like this,

cols = ['col5', 'col6', 'col7', 'col8']
col_sum = df.groupby(cols)['col4'].sum()
col_sum

Output: this is a multi-level dataframe. Last column is your output,

col5  col6       col7       col8  
chr5  112312630  112321662  DCP2       31
      137676883  137676949  FAM53C    165
Abdur Rehman
  • 1,071
  • 1
  • 9
  • 15