2

I have two dataframes like below that shares the column names but information can be conflicting. (First dataframe has more information (ranging 0-5), second has less (range 0 and 4). I merge this two dataframes using dplyr or other R package but what I want to do is to use the information from the first dataframe when available, and when it isn't I use the information from the second dataframe.

For example, two dataframes to be merged are

ID    col1   col2  
A      0      0
B      3      0
C      5      3
ID   col1    col2 
A     4       4
B     0       0
C     4       4
D     0       4
E     4       4

What I would like after merging is:

ID    col1   col2
A      4      4
B      3      0
C      5      3
D      0      4
E      4      4
Yoshiki
  • 33
  • 5
  • 2
    Can you expand on "more useful". It looks like you mean "bigger". Row `C` demonstrates you wanting the data from df1 because it contains the bigger range (5). Is that correct? – Amar Feb 20 '18 at 06:49
  • @Amar Yes. I meant by _more info_ that df1 has more _detailed_ information bc the range is 0-5, while df2 has two level (0 and 4). On the other hand, df1 has three rows and df2 has five rows. In that sense, df2 indeed has more info. Sorry for the confusing expression – Yoshiki Feb 20 '18 at 08:14

1 Answers1

3

One option can be using dplyr and tidyr. First replace 0 in dataframes with NA which helps in deciding if that value is valid or not. Merge both dataframe using bind_rows. Then group on ID so that rows are arrange on that column. Use fill to replace NA from other row of same ID.

# Data
df1 <- read.table(text="ID    col1   col2  
A      0      0
B      3      0
C      5      3", header = T, stringsAsFactors = F)

df2 <- read.table(text = "ID   col1    col2 
A     4       4
B     0       0
C     4       4
D     0       4
E     4       4", header = T, stringsAsFactors = F)

df1[df1==0] <- NA
df2[df2==0] <- NA

library(dplyr)
library(tidyr)

df1 %>% bind_rows(df2) %>%
  group_by(ID) %>%
  fill(col1, col2, .direction = "up")  %>%
  fill(col1, col2, .direction = "down") %>%
  unique() %>%
  filter((row_number() == 1)) 

#Result
#  ID     col1  col2
#  <chr> <int> <int>
#1 A         4     4
#2 B         3    NA
#3 C         5     3
#4 D        NA     4
#5 E         4     4

Note: The NA can be replaced with 0

MKR
  • 19,739
  • 4
  • 23
  • 33
  • if in second `data.frame` 2nd line was `B 0 1`, your output for second row would be `B 3 1`, and as I understand it should stay `B 3 0` – moodymudskipper Feb 20 '18 at 14:51
  • @Moody_Mudskipper You are correct. In my answer individual values (if non-zero) would be get preference. Hence it can have data mixed from both rows. I think that was intention of OP. Otherwise things could have been easier. – MKR Feb 20 '18 at 15:14
  • Yes, `OP` will have to clarify, I'm confused as how he's going to make sense of this data if it doesn't mean the same in both dfs. The `C` line for example shows a bigger value in `col2` so mixing both dfs without keeping info on source doesn't make sense to me. If it wasn't for this C line I would have just suggested `group_by` `summarize_all` `max` – moodymudskipper Feb 20 '18 at 15:21
  • Not sure I correctly understand your discussions above, but my intention is as follows (a-e); a. I have a dataset of two (df1 and df2) in which I have sth like customer preference. b. df1 has five ranges (1-5) and zero but for smaller number of customers c. df2 has binary input (0,1) but for larger number of customers d. I arbitrarily assume that 1 in the second data frame is equivalent to 4 in df1 e. I want to merge this two df’s but because of the reasons above, df1 has more precise input hence more priority. – Yoshiki Feb 21 '18 at 05:39
  • @Yoshiki That description is understood by us. Thx. The confusion is what will happen when say a row in `df1` was `B 3 0` and in `df2` it was `B 0 4`. My solution will merge those as `B 3 4`. Is that correct? – MKR Feb 21 '18 at 06:00
  • 1
    @MKR That's correct. I want row `B` to be `3 4` in that case – Yoshiki Feb 21 '18 at 06:58