0

I have the following dataframe (79000 rows):

ID       P1      P2      P3      P4        P5        P6      P7     P8  
1       38005   28002   38005   38005    28002    34002      NA     NA
2       28002   28002   28002   38005    28002    NA         NA     NA

I want to count the number of times each number(code) appears in a row of dataframe. So the ouput something like this:

38005 appears 3   28002 appears 2    34002 appears 1     NA appears 2 
28002 appears 3   38005 appears 1    28002 appears 1     NA appears 3 

So far I tried to find the most frequent number (code):

df$frequency <-apply(df,1,function(x) names(which.max(table(x))))

But I don't know how to count the number of times each number(code) appears in a row.

tmfmnk
  • 38,881
  • 4
  • 47
  • 67
user10672282
  • 25
  • 1
  • 6

4 Answers4

1

Using tidyverse and reshape2 you can do:

df %>%
 gather(var, val, -ID) %>% #Transforming the data from wide to long format
 group_by(val, ID) %>% #Grouping 
 summarise(count = n()) %>% #Performing the count
 dcast(ID~val, value.var = "count") #Reshaping the data

  ID 28002 34002 38005 NA
1  1     2     1     3  2
2  2     4    NA     1  3

Showing the first two non-NA columns with the biggest count according ID:

df %>%
 gather(var, val, -ID) %>% #Transforming the data from wide to long format
 group_by(val, ID) %>% #Grouping
 mutate(temp = n()) %>% #Performing the count
 group_by(ID) %>% #Grouping
 mutate(temp2 = dense_rank(temp)) %>% #Creating the rank based on count
 group_by(ID, val) %>% #Grouping
 summarise(temp3 = first(temp2), #Summarising 
           temp = first(temp)) %>%
 arrange(ID, desc(temp3)) %>% #Arranging
 na.omit() %>% #Deleting the rows with NA
 group_by(ID) %>%
 mutate(temp4 = ifelse(temp3 == first(temp3) | temp3 == nth(temp3, 2), 1, 0)) %>% #Identifying the highest and the second highest count
 filter(temp4 == 1) %>% #Selecting the highest and the second highest count
 dcast(ID~val, value.var = "temp") #Reshaping the data

  ID 28002 38005
1  1     2     3
2  2     4     1
tmfmnk
  • 38,881
  • 4
  • 47
  • 67
  • 1
    You can also use `tidyr::spread` instead of `dcast`. `... %>% spread(val, count)`. – Ritchie Sacramento Dec 11 '18 at 07:17
  • Thanks. Actually, my data is not limitted to these codes 28002 34002 38005 NA. I have more than 200 codes. How about if I want to find the most two frequents. So fro our example: we will have for row one: 38005 and 28002. and for row two we will have 28002 and 38005. Can we do that? Can we add four colomns to the dataframe (two colomns for the codes 28002 and 38005 and another two colomns for the times these codes appear?) – user10672282 Dec 11 '18 at 07:45
  • That's awesome but can you please help me to put the output in the original dataframe(by adding two columns for the highest value and it’s count and another two columns for the second highest and it’s count) – user10672282 Dec 11 '18 at 22:51
0
ID <- c("P1","P2","P3","P4","P5","P6","P7","P8","P1","P2","P3","P4","P5","P6","P7","P8","P1")
count <-c("38005","28002","38005","38005","28002","34002","NA","NA","2","28002","28002","28002","38005","28002","NA","NA","NA")

df<- cbind.data.frame(ID,count)

table(df$count)

Use this code to find out the count

Hunaidkhan
  • 1,411
  • 2
  • 11
  • 21
  • Thanks. This is ok for two rows but as I mentioned I have more than 79000 rows which is difficult to convert it to our format. – user10672282 Dec 11 '18 at 07:11
  • this is for two columns and for 79000 rows also this will work. you can use as.data.frame(table(df$count)) – Hunaidkhan Dec 11 '18 at 07:15
  • Thanks. Actually, my data is not limitted to these codes 28002 34002 38005 NA. I have more than 200 codes. How about if I want to find the most two frequents. So fro our example: we will have for row one: 38005 and 28002. and for row two we will have 28002 and 38005. Can we do that? Can we add four colomns to the dataframe (two colomns for the codes 28002 and 38005 and another two colomns for the times these codes appear?) – user10672282 Dec 11 '18 at 07:45
0

I think you're looking for this.

sort(table(unlist(df1[-1])), decreasing=TRUE)
# 31002 38005 24003 34002 28002 
# 13222 13193 13019 13018 12625 

This is, you're excluding column 1 that contains the IDs and "unlist" the rest of your data frame into a vector. The table() then counts the appearance of each value, which you also can sort(). Set option decreasing=TRUE and the first two values are the two most frequent ones.

If the output is getting to long because of a lot of values, you can include the code into a head(.). The default length of the output is six, but you can limit it to two by specifying n=2 which gives you exactly what you want. No need for any packages.

head(sort(table(unlist(df1[-1])), decreasing=TRUE), n=2)
# 31002 38005 
# 13222 13193

DATA:

set.seed(42)  # for sake of reproducibility
df1 <- data.frame(id=1:9750,
                  matrix(sample(c(38005, 28002, 34002, NA, 24003, 31002), 7.8e4, 
                                replace=TRUE), nrow=9750,
                         dimnames=list(NULL, paste0("P", 1:8))))
jay.sf
  • 60,139
  • 8
  • 53
  • 110
0

data.table solution

#read sample data
dt <- fread( "ID       P1      P2      P3      P4        P5        P6      P7     P8  
1       38005   28002   38005   38005    28002    34002      NA     NA
             2       28002   28002   28002   38005    28002    NA         NA     NA")
#melt
dt.melt <- melt(dt, id = 1, measure = patterns("^P"), na.rm = FALSE)
#and cast
dcast( dt.melt, ID ~ value, fun = length, fill = 0 )

#    ID 28002 34002 38005 NA
# 1:  1     2     1     3  2
# 2:  2     4     0     1  3
Wimpel
  • 26,031
  • 1
  • 20
  • 37