0

I am trying to remove a list of words in sentences according to specific conditions.

Let's say we have this dataframe :

responses <- c("The Himalaya", "The Americans", "A bird", "The Pacific ocean")
questions <- c("The highest mountain in the world","A cold war serie from 2013","A kiwi which is not a fruit", "Widest liquid area on earth")
df <- cbind(questions,responses)

> df
     questions                           responses          
[1,] "The highest mountain in the world" "The Himalaya"     
[2,] "A cold war serie from 2013"        "The Americans"    
[3,] "A kiwi which is not a fruit"       "A bird"           
[4,] "Widest liquid area on earth"       "The Pacific ocean"

And the following list of specific words:

articles <- c("The","A")
geowords <- c("mountain","liquid area")

I would like to do 2 things:

  1. Remove the articles in first position in the responses column when adjacent to a word starting by a lower case letter

  2. Remove the articles in first position in the responses column when (adjacent to a word starting by an upper case letter) AND IF (a geoword is in the corresponding question)

The expected result should be:

     questions                           responses      
[1,] "The highest mountain in the world" "Himalaya"     
[2,] "A cold war serie from 2013"        "The Americans"
[3,] "A kiwi which is not a fruit"       "bird"         
[4,] "Widest liquid area on earth"       "Pacific ocean"

I'll try gsub without success as I'm not familiar at all with regex... I have searched in Stackoverflow without finding really similar problem. If a R and regex all star could help me, I would be very thankfull!

Tau
  • 173
  • 1
  • 8

4 Answers4

0

The same as you mentioned has been written as two logical columns and ifelse is used to validate and gsub:

responses <- c("The Himalaya", "The Americans", "A bird", "The Pacific ocean")
questions <- c("The highest mountain in the world","A cold war serie from 2013","A kiwi which is not a fruit", "Widest liquid area on earth")
df <- data.frame(cbind(questions,responses), stringsAsFactors = F)

df

articles <- c("The ","A ")
geowords <- c("mountain","liquid area")


df$f_caps <- unlist(lapply(df$responses, function(x) {grepl('[A-Z]',str_split(str_split(x,' ', simplify = T)[2],'',simplify = T)[1])}))


df$geoword_flag <- grepl(paste(geowords,collapse='|'),df[,1])


df$new_responses <- ifelse((df$f_caps & df$geoword_flag) | !df$f_caps, 
                     {gsub(paste(articles,collapse='|'),'', df$responses )  },
                     df$responses)

df$new_responses


> df$new_responses
[1] "Himalaya"      "The Americans" "bird"          "Pacific ocean"
amrrs
  • 6,215
  • 2
  • 18
  • 27
  • Thanks amrrs, you'rethe master (I was far from getting the code working). Just one question : I don't really understand how the stringAsFactors = F works : whay does "The Americans" becomes "2" if I don't specify strinAsFactors = F ? – Tau Dec 01 '17 at 13:05
  • without `stringAsFactors = F` it returns the Factor level instead of the actual value - which is why having it as character returns the right text. – amrrs Dec 01 '17 at 13:42
0

For the fun, here's a tidyverse solution:

df2 <-
df %>%
as.tibble() %>%
  mutate(responses =
        #
        if_else(str_detect(questions, geowords),
                #
                str_replace(string = responses,
                            pattern = regex("\\w+\\b\\s(?=[A-Z])"),
                            replacement = ""),
                #
                str_replace(string = responses,
                            pattern = regex("\\w+\\b\\s(?=[a-z])"),
                            replacement = ""))
        )

Edit: without the "first word" regex, with inspiration from @Calvin Taylor

# Define articles
articles <- c("The", "A")

# Make it a regex alternation
art_or <- paste0(articles, collapse = "|")

# Before a lowercase / uppercase
art_upper <- paste0("(?:", art_or, ")", "\\s", "(?=[A-Z])")
art_lower <- paste0("(?:", art_or, ")", "\\s", "(?=[a-z])")

# Work on df
df4 <-
  df %>%
  as.tibble() %>%
  mutate(responses =
        if_else(str_detect(questions, geowords),
                str_replace_all(string = responses,
                                pattern = regex(art_upper),
                                replacement = ""),
                str_replace_all(string = responses,
                                pattern = regex(art_lower),
                                replacement = "")
                )
        )
meriops
  • 997
  • 7
  • 6
  • By the way I was wondering if it would be more efficient to use the articles list reference instead of "first word" regex. My point is this won't work in another language (like french) where articles might be stick to the second word (without white space), e.g.: "L'inspecteur Clouzot" => in that case, the "L'" won't be removed because the third word is considered as the second one... – Tau Dec 01 '17 at 14:36
  • I solved the issue by changing a bit the code from amrrs : str_split(x,"[' ]+", simplify = T), but I don't know how to do it with the tidyverse way... – Tau Dec 01 '17 at 15:07
  • Thanks meriops, very good solution too. I think you just have to replace "\\s*" by "\\s" in the regex definition, if not the "A" from "The Americans" will be removed... – Tau Dec 03 '17 at 14:01
  • @Tau: I edited but with the "positive look-ahead" following, the "A" shouldn't be eaten up, even with "\s*" (ie 0 or more whitespaces) – meriops Dec 04 '17 at 16:19
0

I taught myself some R today. I used a function to get the same result.

#!/usr/bin/env Rscript

# References
# https://stackoverflow.com/questions/1699046/for-each-row-in-an-r-dataframe

responses <- c("The Himalaya", "The Americans", "A bird", "The Pacific ocean")
questions <- c("The highest mountain in the world","A cold war serie from 2013","A kiwi which is not a fruit", "Widest liquid area on earth")
df <- cbind(questions,responses)

articles <- c("The","A")
geowords <- c("mountain","liquid area")

common_pattern <- paste( "(?:", paste(articles, "", collapse = "|"), ")", sep = "")
pattern1 <- paste(common_pattern, "([a-z])", sep = "")
pattern2 <- paste(common_pattern, "([A-Z])", sep = "")
geo_pattern <- paste(geowords, collapse = "|")

f <- function (x){ 
  q <- x[1]
  r <- x[2]
  a1 <- gsub (pattern1, "\\1", r)
  if ( grepl(geo_pattern, q)){
    a1 <- gsub (pattern2, "\\1", a1)
  }
  x[1] <- q
  x[2] <- a1
}

apply (df, 1, f)

running;

Rscript stacko.R
[1] "Himalaya"      "The Americans" "bird"          "Pacific ocean"
Calvin Taylor
  • 664
  • 4
  • 15
0

You may choose to use simple regex with , grepl and gsub as below:

df <- data.frame(cbind(questions,responses), stringsAsFactors = F) #Changing to data frame, since cbind gives a matrix, stringsAsFactors will prevent to not change the columns to factors
regx <- paste0(geowords, collapse="|") # The "or" condition between the geowords 
articlegrep <- paste0(articles, collapse="|") # The "or" condition between the articles
df$responses <- ifelse(grepl(regx, df$questions)|grepl(paste0("(",articlegrep,")","\\s[a-z]"), df$responses), 
       gsub("\\w+ (.*)","\\1",df$responses),df$responses) #The if condition for which replacement has to happen

> print(df)
                          questions     responses
#1 The highest mountain in the world      Himalaya
#2        A cold war serie from 2013 The Americans
#3       A kiwi which is not a fruit          bird
#4       Widest liquid area on earth Pacific ocean
PKumar
  • 10,971
  • 6
  • 37
  • 52