6

I have some problems with different strings being concatenated and which I would like to split again. I am dealing with things such as

name="o-n-Butylhydroxylamine1-MethylpropylhydroxylamineAmino-2-butanol"

which in this case should be split in "o-n-Butylhydroxylamine", "1-Methylpropylhydroxylamine" and "Amino-2-butanol"

Any thoughts how I could use strsplit and/or gsub regular expression to achieve this? The rule I would like to use is that I would like to split a word when either a number, a bracket ("(") or a capital letter follows a lower caps letter. Any thoughts how to do this?

Tom Wenseleers
  • 7,535
  • 7
  • 63
  • 103

3 Answers3

10

You could use positive look-around assertions to find (and then split at) inter-character positions preceded by a lower case letter and succeeded by an upper case letter, a digit, or a (.

name <- "o-n-Butylhydroxylamine1-MethylpropylhydroxylamineAmino-2-butanol"
pat <- "(?<=[[:lower:]])(?=[[:upper:][:digit:](])"
strsplit(name, pat, perl=TRUE)
# [[1]]
# [1] "o-n-Butylhydroxylamine"      "1-Methylpropylhydroxylamine"
# [3] "Amino-2-butanol"
Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
  • Many thanks to all for the answers - I'll accept this one as it's closest to my exact question, but all seem to work perfectly!! thx!! – Tom Wenseleers Jan 16 '14 at 22:15
  • Josh, I am learning something here. It looks like you can include the `(` within the `[ ]` set without needing to escape it.. do I understand that correctly? – Ricardo Saporta Jan 16 '14 at 22:39
  • @RicardoSaporta -- Yep, that's right. Quoting from `?regex`, "only `'^ - \ ]'` are special inside character classes" (where "character class" refers to that outer pair of brackets, `[]`). – Josh O'Brien Jan 16 '14 at 22:43
  • This is a sexayyyy regex – Carl Boneri Feb 24 '17 at 18:43
3
strsplit(name, "(?<=([a-z]))(?=[A-Z]|[0-9]|\\()", perl=TRUE)
# [[1]]
# [1] "o-n-Butylhydroxylamine"      "1-Methylpropylhydroxylamine" "Amino-2-butanol"

Remember that the return value is a list, so use [[1]] if appropriate.

Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
2

Try this:

name="o-n-Butylhydroxylamine1-MethylpropylhydroxylamineAmino-2-butanol"
print(strsplit(gsub("([a-z])(\\d)","\\1#\\2",
                    gsub("([a-z])([A-Z])","\\1#\\2",name)),"#")[[1]])

It assumes a non-cap letter followed by a digit is a split as well as a non-cap followed by a cap.

crogg01
  • 2,446
  • 15
  • 35