R Tidy solution to select from group_by output based on a column's data availability

Question

I have following R dplyr dataframe in df_pub (Science/Nature Publication Data)

Please note that there are same PMID (or paper) with contributing authors in each row (Authors info is not shown here).

I need to select and store publications (PMID) which has no email attached to it and store the last observation of it in data-frame.

Actually I want to remove all PMIDs having any email in any observation. I need to collect the Publications (PMIDs) which does not have an attached email, and then find the last author or last observation (usually she/he/xe are the group leader or PI, we'll contact them manually and request them to update their email).

So for the example above, the expected output will not contain PMID 22522932 because it has an email attached. For other PMIDs only the last row of each such PMID will be stored.

I started with this but then lost

df_pub %>%
  group_by(pmid) %>%
  filter(is.na(email)) # This does not do the expected

Without a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), I don't know why that code wouldn't have worked — camille, May 30 '19 at 18:11
This might help: https://nsaunders.wordpress.com/2013/02/13/basic-r-rows-that-contain-the-maximum-value-of-a-variable/ — Pomul, May 30 '19 at 18:17

Giovanni Colitti · Accepted Answer · 2019-05-30T18:53:23.270

1

If I understand correctly, this will do what you want:

df_pub %>% 
  group_by(pmid) %>% 
  filter(!any(!is.na(email)),
         row_number() == n())

edited May 30 '19 at 18:53

answered May 30 '19 at 18:08

Giovanni Colitti

1,982
11
24

Thanks! I don't know how you have figured out the problem from my very lousy description. Cakes and cookies. – ghosh'. May 30 '19 at 18:24
Do you want to keep the last observation in group _and then_ remove PMIDs with emails? So if the last observation in group (PMID) has an email, that PMID will not be included in the final dataset. I edited the answer to reflect this. – Giovanni Colitti May 30 '19 at 18:29
Actually I want to remove all PMIDs having any email in any observation. I need to collect the Publications (PMIDs) which does not have an attached email, and then find the last author or last observation (usually she/he/xe are the group leader or PI, we'll contact them manually and request them to update their email). – ghosh'. May 30 '19 at 18:36
@jorge-mendes provided the correct solution, I think. So, I am marking their answer as correct. – ghosh'. May 30 '19 at 18:43
I updated the answer again. Let me know if that's what you want. – Giovanni Colitti May 30 '19 at 18:47
Thanks. Yes, both your and Jorge Mendes' solution work and yield same output. I am not an expert who can tell which solution is more Tidy-er than other. – ghosh'. May 30 '19 at 18:51
2

Definitely mine. – Giovanni Colitti May 30 '19 at 18:54
Yes. It's yours. Because it is done in a single filter query. – ghosh'. May 30 '19 at 18:58

score 1 · Answer 2 · answered May 30 '19 at 18:38

1

I think this is what you wanted. It checks which pmids have no email attached and then shows only the last row.

df_pub %>% 
    group_by(pmid) %>% 
    filter(sum(is.na(email)) == n()) %>% #chooses pmids that number of NAs equals number os rows
    filter(row_number() == n()) #chooses the last row for each pmid

answered May 30 '19 at 18:38

Jorge Mendes

176
1
6

This is the correct solution. Bravo. I am still puzzled that how you three figured out what I was asking for. Thanks. – ghosh'. May 30 '19 at 18:44

score 0 · Answer 3 · answered May 30 '19 at 18:22

0

Try this. Might not be the most concise code, but I think it solves your question.

# Sample dataframe
  pmid   email No
1    1    <NA>  1
2    1    <NA>  2
3    1    <NA>  3
4    2 a@b.com  4
5    2    <NA>  5

# Logic
val <- df$pmid[!is.na(df$email)] %>% unique()
df[!df$pmid %in% val, ] %>% 
  group_by(pmid) %>% 
  slice(n()) %>% 
  ungroup()

# Result
# A tibble: 2 x 3
   pmid email      No
  <dbl> <fct>   <int>
1     1 NA          3

answered May 30 '19 at 18:22

skillsmuggler

1,862
1
11
16

Thanks. Jorge Mendes has provided the elegant tidy code. – ghosh'. May 30 '19 at 18:46

R Tidy solution to select from group_by output based on a column's data availability

3 Answers3