2

I'm removing English characters from Hebrew text but would like to keep a short list of English words that i want, e.g. words2keep <- c("ok", "hello", "yes*"). So my current regex is text <- gsub("[A-Z,a-z]", "", text) , but the question is how to add the exception so it will not remove all English words.

reproducibe example:

text = "ok אני מסכים איתך Yossi Cohen"

after gsub with exception

text = "ok אני מסכים איתך"

Thank you for all suggestions

Dmitry Leykin
  • 485
  • 1
  • 7
  • 14
  • 1
    This pose seems like it might have your answer http://stackoverflow.com/questions/2404010/match-everything-except-for-specified-strings – Mir Henglin Aug 21 '16 at 06:43

2 Answers2

3

This is a tricky one. I think we can do it by matching against whole words by making use of the \b word boundary assertion, and at the same time include a negative lookahead assertion just prior to the match which rejects the words (again, whole words) that you want to blacklist for removal (or equivalently whitelist for preservation). This appears to be working:

gsub(perl=T,paste0('(?!\\b',paste(collapse='\\b|\\b',words2keep),'\\b)\\b[A-Za-z]+\\b'),'',text);
[1] "ok אני מסכים איתך  "
bgoldst
  • 34,190
  • 6
  • 38
  • 64
0

Use gsub function with [A-Z] All uppercase A to Z letters will be removed, for total word removal use .* for total character removal

gsub("[A-Z].*","",text)

[1] "ok אני מסכים איתך "

#data

text = "ok אני מסכים איתך Yossi Cohen"
Arun kumar mahesh
  • 2,289
  • 2
  • 14
  • 22