0

I scraped some data from a website but it was really janky and for some reason had little mistakes in it. So, I scraped the same data 3 times, and produced 3 tables that look like:

library(data.table)
df1 <- data.table(name = c('adam', 'bob', 'carl', 'dan'),
                  id = c(1, 2, 3, 4),
                  thing=c(2, 1, 3, 4),
                  otherthing = c(2,1, 3, 4)
                  )

df2 <- data.table(name = c('adam', 'bob', 'carl', 'dan'),
                  id = c(1, 2, 3, 4),
                  thing=c(1, 1, 1, 4),
                  otherthing = c(2,2, 3, 4)
)

df3 <- data.table(name = c('adam', 'bob', 'carl', 'dan'),
                  id = c(1, 2, 3, 4),
                  thing=c(1, 1, 3, 4),
                  otherthing = c(2,1, 3, 3)
)

Except I have many more columns. I want to combine the 3 tables together, and when the values for "thing" and "other thing" etc. conflict, I want it to pick the value that has at least 2/3 and perhaps return an N/A if there is no 2/3 value. I'm confident the "name" and "id" field are good and they're what I want to sort of merge on.

I was considering setting the names for the tables to be, "thing1" "thing2" and "thing3" in the 3 tables respectively, merging together, and then writing some loops through the names. Is there a more elegant solution? It needs to work for 300+ value columns although I'm not super worried about speed.

In this example, the solution I think should be:

final_result <- data.table(name = c('adam', 'bob', 'carl', 'dan'),
                  id = c(1, 2, 3, 4),
                  thing=c(1, 1, 3, 4),
                  otherthing = c(2,1, 3, 4)
)
Sam Asin
  • 131
  • 9

2 Answers2

3

To generalize the approach from @IceCreamToucan, we can use:

library(dplyr)

n_mode <- function(...) {
  x <- table(c(...))
  if(any(x > 1)) as.numeric(names(x)[which.max(x)])
  else NA
}

bind_rows(df1, df2, df3) %>%
  group_by(name, id) %>%
  summarise_all(funs(n_mode(.)))

N.B. Be careful with your namespace and how you name the function...preferring something like n_mode() to avoid conflicts with base::mode. Finally, if you extend this to more data.frames, you probably want to put them in a list. If that's not possible/practical, you could replace the bind_rows with purrr::map_df(ls(pattern = "^df[[:digit:]]+"), get)

JasonAizkalns
  • 20,243
  • 8
  • 57
  • 116
  • This is amazing. So much slicker than how I was doing it. Can I ask what the ... means in the function? Does it mean it can take different numbers of arguments? – Sam Asin Dec 06 '18 at 21:07
  • My data.frames were actually already in a list. if the data.frames are df[[1]], df[[2]], and df[[3]], does that make it easier? – Sam Asin Dec 06 '18 at 21:16
  • Yes, it can take different numbers of arguments. See https://stackoverflow.com/questions/3057341/how-to-use-rs-ellipsis-feature-when-writing-your-own-function – IceCreamToucan Dec 06 '18 at 21:17
  • Yes, in that case, you should be able to replace `bind_rows(df1, df2, df3)` with `bind_rows(df)`. – JasonAizkalns Dec 06 '18 at 21:18
  • Although with this way of using the function it isn't needed. You can just use `function(x) { x <- table(x)` for the first part of the definition. – IceCreamToucan Dec 06 '18 at 21:19
  • FYI data.table has it's own `bind_rows`, called `rbindlist`, which tends to be faster. `rbindlist(list_of_dfs)` gives a new data.table. – IceCreamToucan Dec 06 '18 at 21:19
  • Thanks guys. To be honest, I can't follow the solution totally yet, so I can't think about trying to do the data.table version. I'm really impressed with this, though. It's probably more me making data.table look bad then anything, but you're making dplyr look really good to me. – Sam Asin Dec 06 '18 at 21:23
  • added a data.table version of this answer in mine – IceCreamToucan Dec 06 '18 at 21:52
1

data table version of Jason's solution (you should leave his as accepted)

library(data.table)
n_mode <- function(x) {
  x <- table(x)
  if(any(x > 1)) as.numeric(names(x)[which.max(x)])
  else NA
}

my_list <- list(df1, df2, df3)

rbindlist(my_list)[, lapply(.SD, n_mode), .(name, id)]

#    name id thing otherthing
# 1: adam  1     1          2
# 2:  bob  2     1          1
# 3: carl  3     3          3
# 4:  dan  4     4          4

Here's the output of rbindlist. Hopefully this makes it more clear why just taking n_mode of all the columns, grouped by name and id, gives the output you want.

rbindlist(my_list)[order(name, id)]

#     name id thing otherthing
#  1: adam  1     2          2
#  2: adam  1     1          2
#  3: adam  1     1          2
#  4:  bob  2     1          1
#  5:  bob  2     1          2
#  6:  bob  2     1          1
#  7: carl  3     3          3
#  8: carl  3     1          3
#  9: carl  3     3          3
# 10:  dan  4     4          4
# 11:  dan  4     4          4
# 12:  dan  4     4          3
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38
  • This looks pretty good, but could you make it general so it would work with as many columns as needed? If there are more than thing and other thing? I did a similar solution in data.table but got hung up on naming issues. – Sam Asin Dec 06 '18 at 21:01
  • Thanks for the data.table solution as well! Yeah, I think I finally get it. i was being silly the whole time and thinking about building out horrizontally, but of course since all the columns are the same name to start with we can rbind them together and THEN mode everything. I get it now! – Sam Asin Dec 07 '18 at 22:09