0

I would like to group by specific columns and summarize the number of row. Using the data frame "foo.txt" below as example:

label   type    var1    var2
A       name1   3       21
A       name1   2       18
A       name2   10      23
B       name3   6       19
C       name4   12      11
C       name4   4       9
C       name5   20      13
C       name5   1       5
C       name6   12      12

I wish to group by "label" and count the unique row of "type" as the output below:

label   number
A       2
B       1
C       3

By using dplyr R package, I could have the code below to get the output:

    library(dplyr)

    data <- read.table("foo.txt", header=T)
    data

    data2 <- data %>%
            group_by(label) %>%
            summarise(number=NROW(unique(type)))
    as.data.frame(data2)
  label number
1     A      2
2     B      1
3     C      3

In Python, I would like to do the same things using dplython module with the code below:

import pandas as pd
from dplython import *

data = pd.read_csv("foo.txt", sep="\t")

data = DplyFrame(data)

data2 = (data >>
        group_by(X.label) >>
        summarize(number=len(X.type.unique())))
data2

However, I received error as below:

Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
TypeError: object of type 'Later' has no len()

How can I get the same output using dplython? Thanks in advance.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
bison72
  • 324
  • 3
  • 15

1 Answers1

0

I have switched to plydata python module and it works for me.

from plydata import *

data2 = (data >>
        group_by('contig') >>
        define(strands = 'len(type.unique())')
bison72
  • 324
  • 3
  • 15