0

I've been struggling with the following for some time now:

I want to calculate the difference in wordcounts (frequency of occurrence of features) between two dataframes. The dataframes contain two columns: feature (words) an frequency.

I want to achieve the following result with df A en df B: All features/words from df A and frequency of A minus frequency of B. However when the feature in A does not appear in B I want the frequency of just A back.

I've tried with a two sapply functions: 1 to obtain a names vector the names: feature and frequency of A, and 1 to obtain the the frequency of the same feature in B if the feature exist otherwise 0. These two vectors where then combined to obtain the desired dataframe. The solution works, but is really slow.

Doe any of you know a faster way of obtaining such results?

DCB
  • 107
  • 12

2 Answers2

0

The basic operation you want here is a left join of the first data frame to the second data frame, using the feature/word as the join condition. One option would be to use the sqldf package:

library(sqldf)
sql <- "select a.feature, a.frequency - coalesce(b.frequency, 0) as difference "
sql <- paste0(sql, "from dfA a left join dfB b on a.feature = b.feature")

result <- sqldf(sql)

This probably isn't the fastest solution available in R, and base R probably offers a more efficient solution. But, the above solution is brief, requiring only a few lines of code, and it is easy to read.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • Offcourse! SQL, brilliant. Thanks I searched arround on the web on R and SQL and also found that `merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)` does a left join. (source: https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right) – DCB Sep 05 '18 at 07:28
  • Yes, but then you'll also have to handle the coalesce part on the second data frame. Also, I have found that `merge` can be messy to get the exact output you want. But whichever method you prefer is fine. – Tim Biegeleisen Sep 05 '18 at 07:29
0

You can use tidy text mining for this.

Please refer the below link. tidy text mining