4

I want to find an efficient way (and code elegant) of group_by where the groups are found by a regex which will run inside a vector of texts (tweets). There might be tweets which show in more than one count, but it's not an issue since I just want to count how many times the candidate is quoted.

The following code actually works but I want the code to identify the regex expressions and group by it, I've tried trough str_count but dind't achieve much.

##Data example:
library('dplyr')
all.t <- data.frame(text =  c("@dottore_marcelo @LorranParadiso @1pedroOsilva @Ronaldocampos00 @jairbolsonaro Mas quem disse que @jairbolsonaro vai resolver todos os problemas do país tem 4 anos? Ele é um ponto de inflexão, quem sabe depois de 8 anos elegeremos um rocha ou um Amoedo, pois a estrada já estará pavimentada. Vamos pensar que no longo prazo a disputa será entre liber e conser",
                              "@Ideias_Radicais Opiniao sobre a Marina Silva? Geraldo Alckmin? vai fazer oq se eles ganhar as eleiçoes?",                                                                                                                                                                          
                              "@pkogos E se a Marina Silva ou o Ciro gomes ganhar?",
                              "@pkogos A França está dominada pela mentalidade esquerdista ! Se a Marina Silva ou o Ciro Ganhar vai acontecer o mesmo" ,                                                                                                                                                                                                                                                
                              "@cirogomes @guilhermefpenna @geraldoalckmin @MarinaSilva @jairbolsonaro @alvarodias_ Passo. Próximo.",                                                                                                                                                                                                                                                                   
                              "@joaopedro27696 @marx_araujo @folha 1) Não sou robô; 2) É \"Amoêdo\" e não \"Amoado\"; 3) Não voto com base em pesquisa, e sim em ideias, currículo e histórico... @jairbolsonaro é populista"),
                    stringsAsFactors = FALSE
)

##regex I want to group_by 
candidatos <- c('bolsonaro|@jairbolsonaro',
                'amoedo|@joaoamoedonovo',
                'marina silva|@marinasilva')

## this is the part I want to improve
bind_rows(
all.t %>% filter(grepl(candidatos[1], text, ignore.case = TRUE)) %>% 
count() %>% mutate(candidato = 'Bolsonaro')

all.t %>% filter(grepl(candidatos[2], text, ignore.case = TRUE)) %>% 
count() %>% mutate(candidato = 'Marina Silva')

all.t %>% filter(grepl(candidatos[3], text, ignore.case = TRUE)) %>%
 count() %>% mutate(candidato = 'João Amoêdo')
)          

The output I get is exactly what I want, but if I add too many classes it's a pain to make for each.

       n candidato       
   <int> <chr>      
 1   3  Bolsonaro
 2   1  Marina Silva
 3   4  João Amoêdo
Ochetski
  • 105
  • 1
  • 13
  • When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Give a sample `all.t`. So you potentially will count tweets in multiple categories if they contain more than one name? – MrFlick Jun 18 '18 at 19:19
  • Edited with data example, actually forgot it, thanks for remind me. It's in portuguese but I think it shouldn't matter much. And yes, they can be counted in multiple categories, no problem. – Ochetski Jun 18 '18 at 19:29

1 Answers1

2

You could use map to iterate over each regular expression. You'll need to provide the regular expressions, but you'll need only one copy of the code to count values. If you provide a named vector of regular expressions, then you can also easily substitute the actual candidate names in place of the regexes in the output (or you can include both in the output if you want to have a record of the regular expression that was used for each candidate):

library(tidyverse)

##regex I want to group_by 
candidatos <- c(Bolsonaro='bolsonaro|@jairbolsonaro',
                "João Amoêdo"='amoedo|@joaoamoedonovo',
                "Marina Silva"='marina silva|@marinasilva')

map_df(candidatos, 
       ~ dat %>% 
           filter(grepl(.x, text, ignore.case=TRUE)) %>% 
           count(), 
       .id="Candidato")
  Candidato        n
  <chr>        <int>
1 Bolsonaro        3
2 João Amoêdo      1
3 Marina Silva     4

To keep the regex in the output:

map_df(candidatos, 
       ~ dat %>% 
           filter(grepl(.x, text, ignore.case=TRUE)) %>% 
           mutate(regex=.x) %>% 
           count(regex), 
       .id="Candidato")

The counting can also be done without filter:

map_df(candidatos, 
       ~ dat %>% 
           summarise(regex=.x,
                     n=sum(grepl(.x, text, ignore.case = TRUE))), 
       .id="Candidato")
eipi10
  • 91,525
  • 24
  • 209
  • 285
  • Thanks, exactly what I was looking for. Just a little correction the all.t is a df (probably my bad in explaining) so the grepl line should be `grepl(.x, all.t$text)`. Overall, great. – Ochetski Jun 18 '18 at 19:43
  • The `grepl` line should still be `grepl(.x, text)`. The data frame is passed to `filter` by the pipe (`%>%`) and shouldn't be restated inside filter. – eipi10 Jun 18 '18 at 19:44
  • Wonderful solution, I'll deep more into the `map` functions. – Ochetski Jun 18 '18 at 20:08