1

I'm working with questionnaire datasets where I need to extract some brands' names from several questions. The problem is each data might have a different question line, for example:

Data #1
What do you know about AlphaToy?

Data #2
What comes to your mind when you heard AlphaCars?

Data #3
What do you think of FoodTruckers?

What I want to extract are the words AlphaToy, AlphaCars, and FoodTruckers. In Excel, I can get those brands' names via flash fill, the illustration is below.

As I working with R, I need to convert the "flash fill" step into an R function, yet I couldn't found out how to do it. Here's desired output:

brandName <- list(
  Toy = c(
    "1. What do you know about AlphaToy?",
    "2. What do you know about BetaToyz?",
    "3. What do you know about CharlieDoll?",
    "4. What do you know about DeltaToys?",
    "5. What do you know about Echoty?"
  ),
  Car = c(
    "18. What comes to your mind when you heard AlphaCars?",
    "19. What comes to your mind when you heard BestCar?",
    "20. What comes to your mind when you heard CoolCarz?"
  ),
  Trucker = c(
    "5. What do you think of FoodTruckers?",
    "6. What do you think of IceCreamTruckers?",
    "7. What do you think of JellyTruckers?",
    "8. What do you think of SodaTruckers?"
  )
)

extractBrandName <- function(...) {
  #some codes here
}

#desired output
> extractBrandName(brandName$Toy)
[1] "AlphaToy"    "BetaToyz"    "CharlieDoll" "DeltaToys"   "Echoty"

As the title says, the function should work to dynamic strings, so when the function is applied to brandName the desired output is:

> lapply(brandName, extractBrandName)
$Toy
[1] "AlphaToy"    "BetaToyz"    "CharlieDoll" "DeltaToys"   "Echoty"     

$Car
[1] "AlphaCars" "BestCar"   "CoolCarz" 

$Trucker
[1] "FoodTruckers"     "IceCreamTruckers" "JellyTruckers"    "SodaTruckers"

Edit:

  • The brand name can be in lowercase, uppercase, or even two words or more, for instance: IBM, Louis Vuitton
  • The brand names might appear in the middle of the sentence, it's not always come at the end of the sentence. The thing is, the sentences are unpredictable because each client might provide different data of each other

Can anyone help me with the function code to achieve the desired output? Thank you in advance!

Edit, here's attempt

The idea (thanks to shs' answer) is to find similar words from the input, then exclude them leaving the unique words (it should be the brand names) behind. Following this post, I use intersect() wrapped inside a Reduce() to get the common words, then I exclude them via lapply() and make sure any two or more words brand names merged together with str_c(collapse = " ").

Code

library(stringr)

extractBrandName <- function(x) {
  cleanWords <- x %>%
    str_remove_all("^\\d+|\\.|,|\\?") %>% 
    str_squish() %>% 
    str_split(" ")
  commonWords <- cleanWords %>% 
    Reduce(intersect, .)
  extractedWords <- cleanWords %>% 
    lapply(., function(y) {
      y[!y %in% commonWords] %>% 
        str_c(collapse = " ")
    }) %>% unlist()
  return(extractedWords)
}

Output (1st test case)

> #output
> extractBrandName(brandName$Toy)
[1] "AlphaToy"    "BetaToyz"    "CharlieDoll" "DeltaToys"   "Echoty"     
> lapply(brandName, extractBrandName)
$Toy
[1] "AlphaToy"    "BetaToyz"    "CharlieDoll" "DeltaToys"   "Echoty"     

$Car
[1] "AlphaCars" "BestCar"   "CoolCarz" 

$Trucker
[1] "FoodTruckers"     "IceCreamTruckers" "JellyTruckers"    "SodaTruckers"    

Output (2nd test case)

This test case includes two or more words brand names, located at the middle and the beginning of the sentence.

brandName2 <- list(
  Middle = c("Have you used any products from AlphaToy this past 6 months?",
             "Have you used any products from BetaToys Collection this past 6 months?",
             "Have you used any products from Charl TOYZ this past 6 months?"),
  First = c("AlphaCars is the best automobile dealer, yes/no?",
            "Best Vehc is the best automobile dealer, yes/no?",
            "CoolCarz & Bike is the best automobile dealer, yes/no?")
)

> #output
> lapply(brandName2, extractBrandName)
$Middle
[1] "AlphaToy"            "BetaToys Collection" "Charl TOYZ"         

$First
[1] "AlphaCars"       "Best Vehc"       "CoolCarz & Bike"

In the end, the solution to this problem is found. Thanks to shs who gave the initial idea and the answer from the post I linked above. If you have any suggestions, please feel free to comment. Thank you.

rifset
  • 203
  • 1
  • 9
  • Would the target names for extraction, such as `AlphaToy`, always have two capital letters? – Tim Biegeleisen Sep 03 '21 at 11:16
  • @TimBiegeleisen no, it might be in lowercase, uppercase, or even two words or more. edited. – rifset Sep 03 '21 at 11:18
  • Could you get a list of the brand names and run a `str_match()`? This would make it so that if the brand name is present in your list of strings it will return the brand that it matched with, or do you have no knowledge of all of the brands beforehand? – Hansel Palencia Sep 03 '21 at 11:29
  • Also, are the sentences always the same structure? – Hansel Palencia Sep 03 '21 at 11:34
  • Unfortunately I can't get all the brand names list as it local small brands. The questions sentence might differs but have a brand name in it. – rifset Sep 03 '21 at 11:45
  • Have you come up with a rule to identify brand name? – Ronak Shah Sep 03 '21 at 12:40
  • @RonakShah I don't quite understand what you mean by "rule". Could you please elaborate? – rifset Sep 03 '21 at 13:16
  • I mean unless you are planning to use ML or AI for this task you need explicitly "tell" the program what you want. For example, `subset(mtcars, cyl == 4)` will give you rows that has `cyl` value as 4, Not 4.1 or 3.9 or 100 but exactly 4. If you don't have an exact pattern that you are looking for then it will be difficult to solve this problem. – Ronak Shah Sep 03 '21 at 13:25
  • I don't think it requires ML nor AI, I think it can be done using analytical skills only. @shs' answer gave the idea to found the similarity between each sentence then extract the unique brand names in it. The question sentence in one data is always the same but might differ in the other data. I haven't figured how to compile these ideas. – rifset Sep 03 '21 at 14:04
  • You could do the same thing I did to identify common words on the left on the right as well and thus eliminate both. I get the feeling though, if I wrote an answer like that, then you would come up with another case that should be recognized. Because you apparently have not properly thought this through and just want the function to magically adapt to any edge case it encounters – shs Sep 04 '21 at 09:24
  • Yes, I'm working on the code at the moment. THANKS for the idea and the comment @shs – rifset Sep 04 '21 at 13:28

1 Answers1

2

This function checks which words the first two strings have in common and then removes everything from the beginning of the strings up to and including the common element, leaving only the desired part of the string:

library(stringr)

extractBrandName <- function(x) {
    x %>% 
      str_split(" ") %>% 
      {.[[1]][.[[1]] %in% .[[2]]]} %>% 
      str_c(collapse = " ") %>% 
      str_c("^.+", .) %>% 
      str_remove(x, .) %>% 
      str_squish() %>% 
      str_remove("\\?")
  }

lapply(brandName, extractBrandName)
#> $Toy
#> [1] "AlphaToy"    "BetaToyz"    "CharlieDoll" "DeltaToys"   "Echoty"     
#> 
#> $Car
#> [1] "AlphaCars" "BestCar"   "CoolCarz" 
#> 
#> $Trucker
#> [1] "FoodTruckers"     "IceCreamTruckers" "JellyTruckers"    "SodaTruckers"
shs
  • 3,683
  • 1
  • 6
  • 34
  • What is the brand is in the middle of the sentence? For example, "Do you think AlphaToy is the best?" causes the function to fail. It's a good start, but not a robust solution. – Hansel Palencia Sep 03 '21 at 12:03
  • Sure, it does not do that, but it is also not what OP asked for – shs Sep 03 '21 at 12:50
  • The comment section states that the questions might differ, therefore your solution might not work. (i.e. not making it a robust solution) – Hansel Palencia Sep 03 '21 at 12:55
  • Having said that your solution most definitely works on the original data the OP included with their question. – Hansel Palencia Sep 03 '21 at 12:55
  • Yes, @HanselPalencia is right, the brand names can appear in the middle of the sentence. I edited my question to add the clarity. – rifset Sep 03 '21 at 13:13
  • Edited, added my attempt using shs' idea. +1 – rifset Sep 06 '21 at 15:00