0

Could some one help me regular expression for this. I am really struggling.

Basically i want to write a regular expression to separate the string into two sub strings.

For example in the example i want to separate the full string into "comp99810_c0_seq1" and "|m.8409".

test <- "comp99810_c0_seq1|m.8409" 
c1 <- sub("([A-Za-z1-9])(\\|)(m.\\d+)", "\\1", test) 
c2 <- sub("([A-Za-z1-9])(\\|)(m.\\d+)", "\\2\\3", test) 

I was able to get c1 to work but not c2. Can somebody help me....

Thanks Upendra

upendra
  • 2,141
  • 9
  • 39
  • 64
  • What programming language is this? Doesn't it have a `split` function that you can use to split on `|`? – jwodder Mar 21 '14 at 00:44
  • What is the separator? Just `|`? What is the expected pattern of both sides? – Szymon Mar 21 '14 at 00:44
  • Define "won't work" you are using the same regex, how is one "not working"? Also, what host language are you using? It looks like R? –  Mar 21 '14 at 00:44
  • possible duplicate of [Regex group capture in R](http://stackoverflow.com/questions/952275/regex-group-capture-in-r) –  Mar 21 '14 at 00:49

2 Answers2

2

Try to use similar split("|") function from the language you are currently it is using.

However, change the [A-Za-z1-9] into \\w+ and it will work for you.

Currently your regex meaning only one character. Whereas the \\w+ means 1 or more characters from a-zA-Z, 0-9, _

Sabuj Hassan
  • 38,281
  • 14
  • 75
  • 85
  • I did this instead and it worked... `c1 <- sub("(\\w+\\d+)(\\|)(m.\\d+)", "\\1", test)` and `c2 <- sub("(\\w+\\d+)(\\|)(m.\\d+)", "\\2\\3", test)`. Thanks for the hint though... – upendra Mar 21 '14 at 00:55
  • The `\\d+` will hurt you when the input is like `comp99810_c0_seq|m.8409`. Try now! and after that remove that `\\d+` kindly and try again :-) – Sabuj Hassan Mar 21 '14 at 00:58
  • You are right. Eventhough i have only the kind that have both words and numbers and `\\d+` works i don't needed it. Thanks again.. – upendra Mar 21 '14 at 01:30
0

If you don't want to split on "\|", the issue is the first group is missing a repeat character. i.e. ([A-Za-z1-9])+ or ([A-Za-z1-9])*. Because now it is only matching a single character in that set then trying to find the pipe.

Phoenix
  • 612
  • 6
  • 8