0

I'm trying to fill in missing data in data frames.

I've got two data frames, however each one includes information that is missing in the other one. They look something like this, where the x are numbers:

           DATA FRAME 1                                      DATA FRAME 2    
    Headword   Spelling   Freq                    Headword     Spelling   Freq
     Word1       Sp1a      x                        Word1         Sp1a      x
     Word1       Sp1b      x                        Word1         Sp1c      x
     Word1       Sp1d      x                        Word2         Sp2a      x
     Word2       Sp2a      x                        Word2         Sp2b      x     etc

So, DF1 has spellings 1a, 1b, and 1d for Word 1. DF2 has spellings 1a and 1c for Word1. However, DF1 has only spelling 2a for Word2, but DF2 has spellings 2a and 2b for Word2.

What I need is for both data frames to include all the spellings, even if they're not present in that data. So where Sp1c is missing in data frame 1, I'd like it to be there and the frequency = 0.

So I'd like it to look like this:

    DATA FRAME 1                               DATA FRAME 2
Headword Spelling Freq                    Headword Spelling Freq
  Word1    Sp1a    x                        Word1    Sp1a     x
  Word1    Sp1b    x                        Word1    Sp1b     0
  Word1    Sp1c    0                        Word1    Sp1c     x
  Word1    Sp1d    x                        Word1    Sp1d     0
  Word2    Sp2a    x                        Word2    Sp2a     x
  Word2    Sp2b    0                        Word2    Sp2b     x

I think I need to use more than one join and combine them together to make this work but I'm not sure how.

How do I add any missing values from data frame 2 to the headword and spelling columns in data frame 1, and then set the frequency to be 0? (And vice versa to add missing values to data frame 1.)

Rose
  • 137
  • 2
  • 10

2 Answers2

3

using dplyr library, Firstly creating the data frames

library("dplyr")
df1<- data.frame(headword = c("word1","word1","word1","word2"),
                spelling = c("sp1a","sp1b","sp1d","sp2a"),
                freq = runif(1:4))
df2 <-data.frame(headword = c("word1","word1","word2","word2"),
                 spelling = c("sp1a","sp1c","sp2a","sp2b"),
                 freq = runif(1:4))

Now, find the values in df1 that are not in df2

sub_res1 <- anti_join(df1,df2,by=c("headword","spelling"))
#       headword spelling      freq
#  1    word1     sp1b 0.6738556
#  2    word1     sp1d 0.4972938
sub_res1$freq <- 0
df2 <- full_join(df2,sub_res1,by=c("headword","spelling","freq"))

#    headword spelling       freq
#  1    word1     Sp1a 0.50293511
#  2    word1     sp1c 0.67857973
#  3    word2     sp2a 0.05604982
#  4    word2     sp2b 0.83378253
#  5    word1     sp1b 0.00000000
#  6    word1     sp1d 0.00000000

the vice versa will merge the values of df2 that are not in df1 with the freq as 0

clarification: In your question, you had used x to mean some arbitrary number, So, I used runif to generate some arbitrary numbers instead of using x.

ArunK
  • 1,731
  • 16
  • 35
  • This works perfectly, thank you! I've been having a look through the code to make sure that I understand, and it's very clear. The only thing I'm unsure of is the runif part in the df at the top. What does that bit do? (I read the documentation but couldn't follow it.) – Rose Jul 08 '16 at 13:15
0

Use merge function to join the missing values in both the dataframes

df <- merge (dataframe1, dataframe2 , all.x=T, all.y=T)
Arun kumar mahesh
  • 2,289
  • 2
  • 14
  • 22