gsub with exception in R

Question

I'm removing English characters from Hebrew text but would like to keep a short list of English words that i want, e.g. words2keep <- c("ok", "hello", "yes*"). So my current regex is text <- gsub("[A-Z,a-z]", "", text) , but the question is how to add the exception so it will not remove all English words.

reproducibe example:

text = "ok אני מסכים איתך Yossi Cohen"

after gsub with exception

text = "ok אני מסכים איתך"

Thank you for all suggestions

This pose seems like it might have your answer http://stackoverflow.com/questions/2404010/match-everything-except-for-specified-strings — Mir Henglin, Aug 21 '16 at 06:43

score 3 · Accepted Answer · answered Aug 21 '16 at 07:12

This is a tricky one. I think we can do it by matching against whole words by making use of the \b word boundary assertion, and at the same time include a negative lookahead assertion just prior to the match which rejects the words (again, whole words) that you want to blacklist for removal (or equivalently whitelist for preservation). This appears to be working:

gsub(perl=T,paste0('(?!\\b',paste(collapse='\\b|\\b',words2keep),'\\b)\\b[A-Za-z]+\\b'),'',text);
[1] "ok אני מסכים איתך  "

Arun kumar mahesh · Answer 2 · 2016-08-21T07:55:13.863

0

Use gsub function with [A-Z] All uppercase A to Z letters will be removed, for total word removal use .* for total character removal

gsub("[A-Z].*","",text)

[1] "ok אני מסכים איתך "

#data

text = "ok אני מסכים איתך Yossi Cohen"

edited Aug 21 '16 at 07:55

answered Aug 21 '16 at 07:52

Arun kumar mahesh

2,289
2
14
22

2

This is just working by chance and has nothing to do with the question – David Arenburg Aug 21 '16 at 07:53

gsub with exception in R

2 Answers2

Linked