I would like to group by specific columns and summarize the number of row. Using the data frame "foo.txt" below as example:
label type var1 var2
A name1 3 21
A name1 2 18
A name2 10 23
B name3 6 19
C name4 12 11
C name4 4 9
C name5 20 13
C name5 1 5
C name6 12 12
I wish to group by "label" and count the unique row of "type" as the output below:
label number
A 2
B 1
C 3
By using dplyr
R package, I could have the code below to get the output:
library(dplyr)
data <- read.table("foo.txt", header=T)
data
data2 <- data %>%
group_by(label) %>%
summarise(number=NROW(unique(type)))
as.data.frame(data2)
label number
1 A 2
2 B 1
3 C 3
In Python, I would like to do the same things using dplython
module with the code below:
import pandas as pd
from dplython import *
data = pd.read_csv("foo.txt", sep="\t")
data = DplyFrame(data)
data2 = (data >>
group_by(X.label) >>
summarize(number=len(X.type.unique())))
data2
However, I received error as below:
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
TypeError: object of type 'Later' has no len()
How can I get the same output using dplython? Thanks in advance.