fill data.frame by matching value with column name

Question

I have two data.frames:

df1 (empty, but with specific colnames)

apple orange banana pear grape
0      0       0       0     0

df2

fruit1  count1 fruit2 count2
apple   2      pear   1
grape  4      orange 2
banana 1      NA     NA

this is the output I'd like:

apples oranges bananas pears grapes
2      0       0       1     0
0      2       0       0     4
0      0       1       0     0

I've considered doing something along the lines of:

for f (in range(nrow(df2))){
  for (i in range(ncol(df1))){
     if(fruit1==columnName[i]){
         df1[f,i]<-count1
         ect...

However, i'm dealing with a rather large dataset, and this doesn't seem like the right way to do it.

You should really consider being careful with plurals, apple vs apples, etc. Kind of a pain to handle those exceptions. — Frank, May 24 '16 at 21:34
Is your need to reorganize `df2` so that the output contains fruits that are columns in `df1`? If so, why not `data.frame(apples = 2, oranges = 2, bananas = 1, pears = 1, grapes = 4)` What new information is carried in those zero cells? — Gopala, May 24 '16 at 21:35
Frank you are right, I made a mistake in the names of my example. — Luke Anderson- Trocme, May 24 '16 at 23:23
Gopala, that's a good point, however I need the zeros as I will be doing further steps with the output — Luke Anderson- Trocme, May 24 '16 at 23:24

Frank · Accepted Answer · 2016-05-25T21:41:46.870

4

A data.table option:

library(data.table)
setDT(df2)

# add row number
df2[, r := .I]

# "melt" common columns together
cols = c("fruit", "count")
m2 = melt(df2, measure=patterns(cols), value.name=cols)

# add unobserved fruits, if any
m2[, fruit := factor(fruit, levels = names(df1))]

# "cast" each fruit to its own column, ignoring NA rows
dcast(m2[!is.na(fruit)], r ~ fruit, fill = 0L, drop = FALSE)


   r apple orange banana pear grape
1: 1     2      0      0    1     0
2: 2     0      2      0    0     4
3: 3     0      0      1    0     0

The factor thing is just in case you have some extra levels in df1 that don't appear in df2. If you have a row where both fruit1 and fruit2 are blank, you'll have to figure out how you want to extend this approach.

edited May 25 '16 at 21:41

answered May 25 '16 at 02:38

Frank

66,179
8
96
180

I'm still trying to wrap my head around your answer. Could you explain what you did here a little more? – Luke Anderson- Trocme May 25 '16 at 15:36
1

@Luke I've added some comments. You can look at `m2` to get a better idea of what "melting" means; "casting" with `dcast` does the opposite operation (going from long data to wide, https://www.google.com/#q=long+wide+data+r). The syntax is much nicer for these than for the base function `reshape`, in my opinion. The data.table way is covered in the fifth vignette for the package https://github.com/Rdatatable/data.table/wiki/Getting-started – Frank May 25 '16 at 15:40
1

Yep, that makes things a lot easier to follow. Great answer and edits. Thanks – Luke Anderson- Trocme May 25 '16 at 16:27
I'm trying to extend your line adding unobserved fruit. For instance if in we add another fruit to df1 (that doesn't exist in df2), eg: `df1 <- data.frame(apple=0, orange=0, banana=0, pear=0, grape=0, pineapple=0)`. I think maybe the easiest way to deal with this pineapple is to add it to the final table afterwards. – Luke Anderson- Trocme May 25 '16 at 20:54
1

@Luke I didn't realize `dcast` was dropping unused levels, oops. Adding `drop = FALSE` do the `dcast` call (edited in above) should fix that. Alternately, yeah, you could just add columns afterwards, like `res = dcast(...); res[, (my_extra_cols) := 0L]`. I prefer the `factor` way. – Frank May 25 '16 at 21:42

MKG · Answer 2 · 2016-05-25T23:42:24.673

Here is another way to accomplish it with dplyr and tidyr with less lines of code. But this does not use df1.

library(dplyr)
library(tidyr)

# Create Data Frame
df2 <- data.frame(fruit1 = c("apple", "grape", "banana"), 
                  count1 = c(2, 4, 1), 
                  fruit2 = c("pear", "orange", NA), 
                  count2 = c(1, 2, NA))

# Add row numbers
df2$row_num <- seq(1:3)

final <- df2 %>%
         select(fruit1, count1, row_num) %>% #select the first group of fruit variables
         rename(fruit2 = fruit1, count2 = count1) %>% # rename variables so they have the same name
         bind_rows(df2[, c(3:5)]) %>% # combine with the second set of fruit variables
         filter(!is.na(fruit2)) %>% # remove rows without a fruit
         spread(fruit2, count2, fill = 0) %>% # spread data
         select(-row_num) # remove the row numbers

Hi, thanks for the answer. I think there may be a mistake in the example you gave. I think you made a typo here: `spread(fruit2, count2, fill = 0) %>%` — Luke Anderson- Trocme, May 25 '16 at 15:00

score 1 · Answer 3 · answered May 25 '16 at 00:50

There's probably a more elegant solution, but here is one way to approach it. The idea is move df2 into long format and then join it onto a long format data.frame that contains all possible fruits for each row. Convert the missing values to zero. Then shift the data from long format back to wide.

# assemble df2
df2 <- structure(list(fruit1 = structure(c(1L, 3L, 2L), 
                                  .Label = c("apple", "banana", "grape"), 
                                  class = "factor"), 
               count1 = c(2L, 4L, 1L), 
               fruit2 = structure(c(2L, 1L, NA), 
                                  .Label = c("orange", "pear"), 
                                  class = "factor"), 
               count2 = c(1L, 2L, NA)), 
          .Names = c("fruit1", "count1", "fruit2", "count2"), 
          class = "data.frame", row.names = c(NA, -3L))


# add a column for row numbers
df2$r_number <- 1:nrow(df2)

# load libraries
library(tidyr)
library(dplyr)


# assemble a data.frame in long form by row numbers and fruits
fruit <- df2 %>%
  select(starts_with("fruit"), r_number) %>%
  gather(condition, fruits, starts_with("fruit")) %>%
  mutate(condition = gsub("fruit", "key", condition)) %>%
  filter(!is.na(fruits))


# assemble a data.frame in long form by row numbers and counts
count <- df2 %>%
  select(starts_with("count"), r_number) %>%
  gather(condition, counts, starts_with("count")) %>%
  mutate(condition = gsub("count", "key", condition),
         counts = ifelse(is.na(counts), 0, counts))

# join counts onto fruits
fruit %<>%
  left_join(count, by = c("condition", "r_number")) %>%
  mutate(fruits = paste0(fruits, "s")) %>%
  select(-condition)

# create data.frame for all possible row numbers and fruits
all_possibles <- data.frame( r_number = rep(c(1:nrow(df2)), length(unique(fruit$fruits))),
            fruits = unlist(lapply(unique(fruit$fruits), function(x) rep(x, nrow(df2)))))

# join actual fruits/counts onto all possible fruits and counts
# and shift data from long to wide
output <- all_possibles %>%
  left_join(fruit, by = c("r_number", "fruits")) %>%
  mutate(counts = ifelse(is.na(counts), 0, counts)) %>%
  spread(fruits, counts) %>%
  select(-r_number)

score 1 · Answer 4 · answered May 25 '16 at 01:16

Consider this data wrangling base solution:

df1 <- data.frame(apple=0, orange=0, banana=0, pear=0, grape=0)    
df2 <- read.table(text="apple   2      pear   1
                        grape   4      orange 2
                        banana  1      NA     NA",
                  col.names = c("fruit1", "count1", "fruit2", "count2"))

# ITERATE BY ROW TAKING FRUIT AS HEADER AND COUNT AND CELLS
dfList <- lapply(1:nrow(df2), function(x) {
                     temp <- data.frame(df2[x, c('count1','count2')]) 
                     names(temp) <-c(as.character(df2[x,1]), 
                                     as.character(df2[x,3]))
                     return(temp)
          })

# ADD/REMOVE FRUIT COLUMNS
temp <- lapply(dfList, function(y) {
                for (x in names(df1)) {
                    if (!(x %in% names(y))) { y[[x]] <- 0 }
                }
                for (x in names(y)){
                    if (!(x %in% names(df1))) { y[[x]] <- NULL }
                }
                return(y)
         })

finaldf <- do.call(rbind, temp)
names(finaldf) <- paste0(names(finaldf), 's')
finaldf <- finaldf[c('apples','oranges','bananas','pears','grapes')]
finaldf
#   apples oranges bananas pears grapes
# 1      2       0       0     1      0
# 2      0       2       0     0      4
# 3      0       0       1     0      0

fill data.frame by matching value with column name

4 Answers4

Linked