3

I'm trying to write a function in R that will do a few things all at once, and I think that function has to take two data frames to work.

In a previous question, I asked how to add rows from data frames to one another. I ended up using this code to do that, as provided in one of the answers:

MissingFromC1 <- anti_join(C2, C1, by = c("HW", "Var"))
MissingFromC1$Freq <- 0
All_c1 <- full_join(C1, MissingFr1, by = c("HW", "Var", "Freq"))

Where C1 and C2 are two data frames made up of three columns: HW, Var, and Freq. Each HW has several Var of various frequencies. They look like this:

            C1                                               C2    
Headword   Spelling   Freq                    Headword     Spelling   Freq
 Word1       Sp1a      x                        Word1         Sp1a      x
 Word1       Sp1b      x                        Word1         Sp1c      x
 Word1       Sp1d      x                        Word2         Sp2a      x
 Word2       Sp2a      x                        Word2         Sp2b      x     
 Word3       Sp1a      x                        

C1 and C2 aren't the same - each includes HW and Var that aren't in the other. I wanted to make sure the two were both the same length and so the code above adds missing rows from C2 to C1 (and then I ran it again but on the other data frame).

What I want to do now is turn this into a function. But with a change - I only want to join rows where the Var is missing from a HW. I don't want to add new HW to C1 or C2, just missing Var. In fact, if a HW is in C1 but not C2, for example, then I'd like it filtering out - i.e. in the example above, Word3 is in C1 but there are no Word3 Vars in C2 at all, so I'd like it filtering out completely. (I'm wanting to compare ratios of Var for each HW, but this won't work if I have any HW made up of Var that all have Freq = 0). I hope this makes sense!

I had a go at writing the code for it, just to try and show what I'm trying to do (I realise this code is very wrong! I just thought it might help).

add.missing.to.df1 <- function(df1, df2) {
if(is.element(df2$HW, df1$HW))) 
  missing.val <- anti_join(df2, df1, by = c("HW", "Var"))
  missing.val$Fr <- 0
  All_df2 <- full_join(df1, miss.val, by = c("HW", "Var", "Fr"))
  df2_fin <- filter(All_df2, if(!is.element(df2$HW, df1$HW)))
  } 

So in the end, I want to have two data frames. Each one includes HW that has at least one Var in both data frames. If HW is in C1 but not C2 (or vice versa) then I want to filter it out.

Is it possible to do all this? And is it possible to tie it all up into a function? If so, how?

Thank you to anyone who can help!

Community
  • 1
  • 1
Rose
  • 137
  • 2
  • 10
  • 1
    `function(df1, df2)` this already takes in 2 dataframes into the function, are you trying to return 2 dataframes, if so return a list with 2 dataframes. – zx8754 Jul 11 '16 at 10:27
  • Ah, I thought that was one of the parts that wasn't working in my very confused script - good to know that part was right. I'll change the question title so it's closer to asking what I want then. Thank you! – Rose Jul 11 '16 at 10:29
  • I may have the wrong end of the stick, but would a simple `dplyr::inner_join()` do the trick? This returns rows where there are matching cases in both dfs only? See for example http://stat545.com/bit001_dplyr-cheatsheet.html#inner_joinsuperheroes-publishers – Phil Jul 11 '16 at 10:35
  • As an aside, I've voted your question up because you've shown what you've tried to do. If you need to ask another question in future this post might help you even more to produce a reproducible example: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Phil Jul 11 '16 at 10:38
  • Thanks, Phil. Do you know, I think an inner join might do it. Feel a bit embarrassed now that I've massively over complicated things. Am I best leaving this question here now, or deleting it? – Rose Jul 11 '16 at 10:57
  • @Rose I'll post an answer (which you can accept if it solves your problem), and that way it's documented in case someone else has a similar question in the future. – Phil Jul 11 '16 at 11:42

1 Answers1

1

As we discussed in the comments, it looks like an dplyr::inner_join() will do what you need. From the documentation:

inner_join return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

So using your data you could try:

library("dplyr")
df <- inner_join(C1, C2, by = c("Headword", "Spelling"))
df
#   Headword Spelling Freq.x Freq.y
# 1    Word1     Sp1a      1      1
# 2    Word2     Sp2a      4      3

As for your original question about calling two data frames in a function, this is just done with:

my_function <- function(df1, df2, ...) {
  # do some stuff here
}

Then called with my_function(df1, df2).

Phil
  • 4,344
  • 2
  • 23
  • 33