How do I group similar strings and create the index variable for each group in R?

Question

I have the dataset, which has variable company. The original data structure was a drop-off list. I scraped the data and now I have several rows belonging to the same company.

I need to wish to create the index variable which will identify each group.

Basically, I need to compare the rows' values of the variable company and create a group variable. I am very new to R and don't know how to start.

structure(list(bond = c("PFE 4.65 03/01/18 Corp", "PFE 4.65 03/01/18 Corp", 
"PFE 4.65 03/01/18 Corp", "PFE 4.65 03/01/18 Corp", "PFE 4.65 03/01/18 Corp", 
"PFE 4.65 03/01/18 Corp", "PFE 4.65 03/01/18 Corp", "PFE 4.65 03/01/18 Corp", 
"PFE 4.65 03/01/18 Corp", "PFE 4.65 03/01/18 Corp"), year = c(2007, 
2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007), company = c("T Rowe Price Group Inc", 
"T Rowe Price Associates Inc", "T ROWE PRICE ASSOCIATES INC", 
"Nationwide Fund Advisors", "Nationwide Fund Advisors", "Nationwide Financial Services Inc", 
"Nationwide Financial Services Inc", "Citizens Inc/TX", "Citizens Inc/TX", 
"Woodmen of the World Life Insurance Society"), portfolio = c(NA, 
"Multiple Portfolios", "T Rowe Price Balanced Fund Inc", "Multiple Portfolios", 
"NVIT Bond Index Fund", "Multiple Portfolios", "Nationwide Bond Index Fund", 
"Multiple Portfolios", "SECURITY INDUSTRIAL INSURANCE COMPANY", 
"Multiple Portfolios"), q1 = c(0, 0, 0, 450, 0, 450, 450, 1000, 
1000, 15000), q2 = c(0, 0, 0, 450, 0, 450, 450, 1000, 1000, 15000
), q3 = c(0, 0, 0, 450, 0, 450, 450, 1000, 1000, 15000), q4 = c(0, 
0, 0, 450, 265, 185, 185, 1000, 1000, 15000), ind = c(TRUE, FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), index = c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), total = c(18635, 18635, 
18635, 18635, 18635, 18635, 18635, 18635, 18635, 18635), mp = c(0, 
1, 0, 1, 0, 1, 0, 1, 0, 1), total_1 = c(0, 0, 0, 450, 450, 635, 
635, 1635, 1635, 16635), match = c(FALSE, FALSE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE)), row.names = c(NA, 
-10L), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), groups = structure(list(
    index = 1L, .rows = list(1:10)), row.names = c(NA, -1L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE))
> head(short,10)

It is not really clear to me what your asking. You provided an input example, can you please provide an output example? This will likely be a duplicate question in SO (in other words, the answer you are looking for is likely already in SO). — steveb, Jun 07 '21 at 21:33
@steveb I would like to add a categorical variable with the following values: group=c("T Rowe","T Rowe","T Rowe", "Nationwide","Nationwide","Nationwide","Nationwide","Citizens","Citizens","Woodmen") — Hanna Onyshchenko, Jun 07 '21 at 22:09
@HannaOnyshchenko In general it is best to update the question with the clarification instead of adding to a comment. You should add the output data.frame you would like. You may want to have a look at the link for [asking a good R question](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Also, you should google using factors in R. That is most likely what you are looking for. — steveb, Jun 07 '21 at 23:41
@HannaOnyshchenko Another reason for a well formed question, where one can grab the data you provided (you did that in the snippet), and the exact output you desire (and / or what you want to do to the data), is it makes it much easier for others to help. Your question may get answered within a minute or so, instead of hours or days. — steveb, Jun 07 '21 at 23:45

score 0 · Answer 1 · answered Jun 07 '21 at 23:48

Here is an idea how to homogenize the company name with a regular expression.

your_data %>%
  mutate(company_index = company %>%
           tolower() %>%
           str_replace("([a-z]{2})[^a-z].*", "\\1") %>%
           str_to_title(),
         .before = 1)

Returning:

# A tibble: 10 x 15
# Groups:   index [1]
   company_index bond    year company  portfolio     q1
   <chr>         <chr>  <dbl> <chr>    <chr>      <dbl>
 1 T Rowe        PFE 4…  2007 T Rowe … NA             0
 2 T Rowe        PFE 4…  2007 T Rowe … Multiple …     0
 3 T Rowe        PFE 4…  2007 T ROWE … T Rowe Pr…     0
 4 Nationwide    PFE 4…  2007 Nationw… Multiple …   450
 5 Nationwide    PFE 4…  2007 Nationw… NVIT Bond…     0
 6 Nationwide    PFE 4…  2007 Nationw… Multiple …   450
 7 Nationwide    PFE 4…  2007 Nationw… Nationwid…   450
 8 Citizens      PFE 4…  2007 Citizen… Multiple …  1000
 9 Citizens      PFE 4…  2007 Citizen… SECURITY …  1000
10 Woodmen       PFE 4…  2007 Woodmen… Multiple … 15000
# … with 9 more variables: q2 <dbl>, q3 <dbl>,
#   q4 <dbl>, ind <lgl>, index <int>, total <dbl>,
#   mp <dbl>, total_1 <dbl>, match <lgl>

Thank you! I used your code to reconstruct the dataset! – Hanna Onyshchenko Jun 08 '21 at 23:14 — Hanna Onyshchenko, Jun 08 '21 at 23:14

score 0 · Answer 2 · answered Jun 08 '21 at 23:21

The solution to my problem:

test=test%>%mutate(company_index = company %>%
     tolower() %>%
         str_replace("([a-z]{2})[^a-z].*", "\\1") %>%
         str_to_title(),
       .before = 1)

     test=test[!duplicated(data.table::rleid(test$portfolio,test$year,test$bond)),]        
    shortest=test
    shortest=shortest[!is.na(shortest$company),]
    for (i in 1:nrow(shortest)){
    if (i==1){
    #shortest$d[i]=1
    new=shortest[FALSE,]
    past=shortest[i,]
    new[i,]=past
    i=i+1
      }
      else{
        if (past$portfolio=="Multiple Portfolios"|is.na(past$portfolio))
    {
      new[nrow(new)+1,]=shortest[i,]
      past=shortest[i,]
      past2=shortest[i-1,]
      i=i+1
    }
    else
    { d=stringdist(shortest$company_index[i], past$company_index, method = "soundex")*stringdist(shortest$company_index[i], past2$company_index, method = "soundex")
      if (d==1){
      new[nrow(new)+1,]=shortest[i,]
      new$portfolio[nrow(new)]=NA
      new[nrow(new)+1,]=shortest[i,]
      past=shortest[i,]
      past2=shortest[i-1,]
      i=i+1
      }
    else {
      new[nrow(new)+1,]=shortest[i,]
      past=shortest[i,]
      past2=shortest[i-1,]
      i=i+1
    }
    }
      }

    }
    new=new[!duplicated(data.table::rleid(new$portfolio,new$year,new$bond)), ]

How do I group similar strings and create the index variable for each group in R?

2 Answers2