1

I have a dataset with different companies that have published articles in different blogs (but they use similar names, not always the same) and I want to group them by similar results and count in how many blogs they have published articles.

I want to group it by similar name results, keep the address of the first result and then check if there is a 1 (published article) or a 0 (no published article) among the variables of the rest of the results.

I have a similar question here for the first part but now I don't know how to manage the 2 actions at the same time.

This is a sample of my dataset:

   name           address           sports_blog nutrition_blog lifestyle_blog nature_blog
   <chr>          <chr>                   <dbl>          <dbl>          <dbl>       <dbl>
 1 Wellington     Adam Martin Sq. 1           1              0              0           0
 2 Wellingtoon    Adam Martin Sq. 1           0              1              0           0
 3 Wellington Co. Adam Martin Sq. 1           0              0              1           0
 4 Welinton       Adam Martin Sq. 1           0              0              0           1
 5 Cornell        Blue cross street           1              0              0           0
 6 Kornell        Blue cross street           0              1              0           0
 7 Coornell       Blue cross street           0              0              0           1
 8 Bleend         Aloha avenue                0              0              1           0
 9 Blind          Aloha avenue                0              0              0           1
10 Laguna         River street                1              0              0           0
11 Papito         Carnival street             1              0              0           0
12 Papeeto        Carnival street             0              0              1           0

And as result, I'm looking for something like this:

  name       address           sports_blog nutrition_blog lifestyle_blog nature_blog
  <chr>      <chr>                   <dbl>          <dbl>          <dbl>       <dbl>
1 Wellington Adam Martin Sq. 1           1              1              1           1
2 Cornell    Blue cross street           1              1              0           1
3 Bleend     Aloha avenue                0              0              1           1
4 Laguna     River street                1              0              0           0
5 Papito     Carnival street             1              0              1           0
Sotos
  • 51,121
  • 6
  • 32
  • 66

1 Answers1

1

You can simply include it in your grouping. Using the function from your previous answer (given by @RuiBarradas), then

library(dplyr)

df %>% 
 group_by(name = name[similarGroups(name)], address) %>% 
 summarise_all(sum)

which gives,

# A tibble: 5 x 6
# Groups:   grp [5]
  name        address         sports_blog nutrition_blog lifestyle_blog nature_blog
  <fct>      <fct>                 <int>          <int>          <int>       <int>
1 Bleend     Alohaavenue               0              0              1           1
2 Cornell    Bluecrossstreet           1              1              0           1
3 Laguna     Riverstreet               1              0              0           0
4 Papito     Carnivalstreet            1              0              1           0
5 Wellington AdamMartinSq1             1              1              1           1
Sotos
  • 51,121
  • 6
  • 32
  • 66
  • 1
    Cool! I didn't realised that a simple sum it's ok because I have a maximum of 1 for each row! Thank you very much, @Sotos! – kikusanchez Dec 23 '19 at 10:41