0

I have a data frame that returns two column variables - word1 and word2 like this:

head(bigrams_filtered2, 20)
# A tibble: 20 x 2
   word1       word2      
   <chr>       <chr>      
 1 practice    risk       
 2 risk        management 
 3 management  rational   
 4 rational    meansend   
 5 meansend    based      
 6 based       process    
 7 process     risks      
 8 risks       identified 
 9 identified  analysed   
10 analysed    solved     
11 solved      mitigated  
12 objective   involves   
13 involves    human      
14 human       perceptions
15 perceptions biases     
16 opportunity jack       
17 differences stakeholder
18 stakeholder perceptions
19 perceptions broader    
20 broader     risk  

I am trying to add two additional column variables to this data.frame so that my output looks like this:

##     word1     word2    n totalbigrams           tf
## 1     st     louis 1930      3426965 0.0005631805
## 2  happy  birthday 1802      3426965 0.0005258297
## 3      1         2 1701      3426965 0.0004963576
## 4    los   angeles 1385      3426965 0.0004041477
## 5 social     media 1256      3426965 0.0003665051
## 6    san francisco 1245      3426965 0.0003632952

I'm following an example from here http://www.rpubs.com/pnice421/347328

Under the heading "Generating Bigrams" they provide the following code as a way of achieving this, but I am returning an error:

totalbigrams <- bigrams_filtered2 %>%
    summarize(total=sum(n))

Error in summarise_impl(.data, dots) : 
Evaluation error: invalid 'type' (closure) of argument.

If anyone has any advice on where I might be going wrong it would be greatly appreciated! Thank you.

Davide Lorino
  • 875
  • 1
  • 9
  • 27
  • You can have `summarize(total = sum(n()))` or you could calculate n first `summarize(n = n())`. I assume you also want to `group_by` word1 or word2 or both? But it is not clear from your question, you might want to read this on providing a simple reproducible example: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Sarah Apr 20 '18 at 04:47

2 Answers2

1

First, let's make an example data set that has the same structure as what you are dealing with.

library(tidyverse)
library(tidytext)
library(janeaustenr)


bigram_df <- data_frame(txt = prideprejudice) %>%
    unnest_tokens(bigram, txt, token = "ngrams", n = 2) %>%
    separate(bigram, c("word1", "word2"), sep = " ")

bigram_df

#> # A tibble: 122,203 x 2
#>    word1     word2    
#>    <chr>     <chr>    
#>  1 pride     and      
#>  2 and       prejudice
#>  3 prejudice by       
#>  4 by        jane     
#>  5 jane      austen   
#>  6 austen    chapter  
#>  7 chapter   1        
#>  8 1         it       
#>  9 it        is       
#> 10 is        a        
#> # ... with 122,193 more rows

Now we can find the number of times each bigram is used using dplyr's count(), the total number of bigrams altogether, and term frequency tf. The key here is to use tidyr's unite() and separate() to stick the columns with the two words together and then break them apart again.

bigram_df %>%
    unite(bigram, word1, word2, sep = " ") %>%
    count(bigram, sort = TRUE) %>%
    separate(bigram, c("word1", "word2"), sep = " ") %>% 
    mutate(totalbigrams = sum(n),
           tf = n / totalbigrams)

#> # A tibble: 54,998 x 5
#>    word1 word2     n totalbigrams      tf
#>    <chr> <chr> <int>        <int>   <dbl>
#>  1 of    the     464       122203 0.00380
#>  2 to    be      443       122203 0.00363
#>  3 in    the     382       122203 0.00313
#>  4 i     am      302       122203 0.00247
#>  5 of    her     260       122203 0.00213
#>  6 to    the     252       122203 0.00206
#>  7 it    was     251       122203 0.00205
#>  8 mr    darcy   243       122203 0.00199
#>  9 of    his     234       122203 0.00191
#> 10 she   was     209       122203 0.00171
#> # ... with 54,988 more rows

Created on 2018-04-22 by the reprex package (v0.2.0).

It sounds like you have done some filtering. You certainly can do that with dplyr's filter() whenever the words are separated out into two columns.

Julia Silge
  • 10,848
  • 2
  • 40
  • 48
  • Thank you so much for this, really helpful answer! I don't have the requisite experience to 'upvote' but this was perfect! – Davide Lorino Apr 24 '18 at 03:02
0

You're getting an error because there is no variable called n in your data frame. You need to generate that first. The specific error you're getting is because n is defined in the tidyverse suite of functions, it's a function which counts the number of rows in the data (or a subset thereof).

I don't know what n should be in your data, but you need to get that before you can use that particular function.

Melissa Key
  • 4,476
  • 12
  • 21
  • Thank you so much for showing me that - it's true that part is where it's going wrong. The number of rows is 22384 but i'm having trouble storing the n as a column variable. Thanks again for pointing that out! – Davide Lorino Apr 20 '18 at 04:00