Count unique values by column and group of rows

Question

I have this example: df.Journal.Conferences

venue author0 author1 author2 ... author19
A     John    Mary
B     Peter   Jacob   Isabella  
C     Lia
B     Jacob   Lara    John
C     Mary
B     Isabella

I want to know how many unique authors are in each venue

Result:

A 2
B 5
C 2

Edit: Here is the link to my data: GoogleDrive Excel sheet.

Mouad_Seridi · Accepted Answer · 2017-05-12T14:24:13.983

0

because your data was hard to reproduce, I generated a "similar" data set, this should word

set.seed(1984)
df <- data.frame(id = sample(1:5,10, replace= T), 
                 v1 = sample(letters[1:5],10,replace= T),
                 v2 = sample(letters[1:5],10,replace= T),
                 v3 = sample(letters[1:5],10,replace= T),
                 v4 = sample(letters[1:5],10,replace= T), 
                 stringsAsFactors = F)


z <- data.frame( id = unique(df$id), n = NA )

for (i in z$id)  {

  z$n[z$id == i] <- length(unique(unlist(df[df$id == i,-1])))

}

z
#   id n
# 1  4 4
# 2  3 4
# 3  2 4
# 4  5 4
# 5  1 3

edited May 12 '17 at 14:24

answered May 11 '17 at 20:19

Mouad_Seridi

2,666
15
27

Didn't work. If I have more than one row of a venue, it brings both rows with different values. – ABueno May 11 '17 at 20:27
sorry about that, I didn't realize the venues were not distinct, I edited the answer. – Mouad_Seridi May 11 '17 at 20:37
I spotted an error, I edited again, check the last version. – Mouad_Seridi May 11 '17 at 20:38
In you example, it should return: `id = n, 4= 4, 3= 4, 2= 4, 5= 4 1= 3` – ABueno May 11 '17 at 21:07
I did that a little too quickly, it should be fixed now. – Mouad_Seridi May 12 '17 at 14:24
you're welcome, consider and hitting the check mark to indicate your satisfaction with the answer. – Mouad_Seridi May 12 '17 at 18:38

zx8754 · Answer 2 · 2017-05-12T21:29:05.667

Using dplyr and tidyr, reshape the data from wide to long, then group by count.

library(dplyr)
library(tidyr)

gather(df1, key = author, value = name, -venue) %>% 
  select(venue, name) %>% 
  group_by(venue) %>% 
  summarise(n = n_distinct(name, na.rm = TRUE))
# # A tibble: 3 × 2
#   venue     n
#   <chr> <int>
# 1     A     2
# 2     B     5
# 3     C     2

data

df1 <- read.table(text ="
venue,author0,author1,author2
A,John,Mary,NA
B,Peter,Jacob,Isabella
C,Lia,NA,NA
B,Jacob,Lara,John
C,Mary,NA,NA
B,Isabella,NA,NA
", header = TRUE, sep = ",", stringsAsFactors = FALSE)

Edit: Saved your Excel sheet as CSV, then read in using read.csv, then above code returns below output:

df1 <- read.csv("Journal_Conferences_Authors.csv", na.strings = "#N/A")

# output

# # A tibble: 427 × 2
#                                     venue     n
#                                    <fctr> <int>
# 1                                    AAAI     4
# 2                                     ACC     4
# 3                               ACIS-ICIS     5
# 4  ACM SIGSOFT Software Engineering Notes     1
# 5       ACM Southeast Regional Conference     5
# 6                                ACM TIST     3
# 7       ACM Trans. Comput.-Hum. Interact.     3
# 8                                    ACML     2
# 9                                    ADMA     2
# 10             Advanced Visual Interfaces     3
# # ... with 417 more rows

Didn't work for me. Returned 1X1 table with the sum of all items — ABueno, May 12 '17 at 13:27
@ABueno Please provide [reproducible example data](http://stackoverflow.com/questions/5963269) — zx8754, May 12 '17 at 13:29
[dataframe](https://drive.google.com/file/d/0ByL6JlH9HswOc196dnVCNDNZcnM/view?usp=sharing) — ABueno, May 12 '17 at 16:53
@ABueno tested on your data, solution works just fine, see edit on how to read your file into R. — zx8754, May 12 '17 at 21:29

Osdorp · Answer 3 · 2017-05-12T13:34:41.397

0

Using @zx8754 data for testing, this code gives want you wanted (assuming you have NA for empty cells in the dataframe):

sapply(split(df1[,-1], df1$venue), function(x) length(unique(x[!is.na(x)])))
# A B C 
# 2 5 2

edited May 12 '17 at 13:34

answered May 12 '17 at 08:20

Osdorp

190
7

It must count unique values. The result must be `# A 2 # B 5 # C 2` – ABueno May 12 '17 at 13:29
Sorry, then should be: `sapply(split(df1[,-1], df1$venue), function(x) length(unique(x[!is.na(x)])))`. I will edit it. – Osdorp May 12 '17 at 13:33

Count unique values by column and group of rows

3 Answers3

data