I am trying to remove a list of words in sentences according to specific conditions.
Let's say we have this dataframe :
responses <- c("The Himalaya", "The Americans", "A bird", "The Pacific ocean")
questions <- c("The highest mountain in the world","A cold war serie from 2013","A kiwi which is not a fruit", "Widest liquid area on earth")
df <- cbind(questions,responses)
> df
questions responses
[1,] "The highest mountain in the world" "The Himalaya"
[2,] "A cold war serie from 2013" "The Americans"
[3,] "A kiwi which is not a fruit" "A bird"
[4,] "Widest liquid area on earth" "The Pacific ocean"
And the following list of specific words:
articles <- c("The","A")
geowords <- c("mountain","liquid area")
I would like to do 2 things:
Remove the articles in first position in the responses column when adjacent to a word starting by a lower case letter
Remove the articles in first position in the responses column when (adjacent to a word starting by an upper case letter) AND IF (a geoword is in the corresponding question)
The expected result should be:
questions responses
[1,] "The highest mountain in the world" "Himalaya"
[2,] "A cold war serie from 2013" "The Americans"
[3,] "A kiwi which is not a fruit" "bird"
[4,] "Widest liquid area on earth" "Pacific ocean"
I'll try gsub without success as I'm not familiar at all with regex... I have searched in Stackoverflow without finding really similar problem. If a R and regex all star could help me, I would be very thankfull!