Regex - Extract a substring from a given string

Question

I have a string here, This is a string: AAA123456789.

So the idea here is to extract the string AAA123456789 using regex.

I am incorporating this with X-Path.

Note: If there is a post to this, kindly lead me to it.

I think, by right, I should substring(myNode, [^AAA\d+{9}]),

I am not really sure bout the regex part.

The idea is to extract the string when met with "AAA" and only numbers but 9 consequent numbers only.

What have you tried? You must have at least one regex you have tried that didn't work (unless you just came in an expected us to do your work for you...) — John3136, Sep 20 '12 at 07:00
Will the string always have the same format? And will you always have exactly nine digits? Because if so, you don't need a regex, just simple substring processing. — Jere Käpyaho, Sep 20 '12 at 07:01
Almost right, just use `\d{9}` (both `+` and `{9}` are repeat operators). — nneonneo, Sep 20 '12 at 07:15
The string isnt always the same format. See, it can be "This is a string: AAA123456789 but not an double". So i cant really use common xpath. So the fullproof solution here is to substring the string, AAA123456789. Well I have tried liek the one shown above. Maybe its wrong. The other way is like what Jere mentioned, common but not fullproof method. substring-after(upper-case(myNode), "STRING"). Then again, this is not suitable. Thanks — Vincent, Sep 20 '12 at 07:24
Alright, this is what is being tried at the moment, `substring(upper-case(myNode), "[^AAA\d{9}]")`. This is failing because the 2nd parameter is a number and substring can't understand it. I also tried to string myNode. Advice please — Vincent, Sep 20 '12 at 07:33
Which flavor do you use with your xpath processor ? If you don't know which xpath processor do you use ? — Stephan, Sep 20 '12 at 08:01

Dimitre Novatchev · Accepted Answer · 2012-09-20T13:01:15.307

11

Pure XPath solution:

substring-after('This is a string: AAA123456789', ': ')

produces:

AAA123456789

XPath 2.0 solutions:

tokenize('This is a string: AAA123456789 but not an double',
              ' '
              )[starts-with(., 'AAA')]

or:

tokenize('This is a string: AAA123456789 but not an double',
              ' '
              )[matches(., 'AAA\d+')]

or:

replace('This is a string: AAA123456789 but not an double',
              '^.*(A+\d+).*$',
              '$1'
              )

edited Sep 20 '12 at 13:01

answered Sep 20 '12 at 12:49

Dimitre Novatchev

240,661
26
293
431

Hi Dimitre, can you please explain the replace part. This is what I understand, I think, replace everything else but *(A+\d+).*$ with $1...is this correct...but what does $1 do... – Vincent Sep 21 '12 at 08:45
2

@Vincent, This means: Replace the whole string (if it contains a substring of the form `A+\d+` ) with just the subexpression that is inside the (1st pair of) brackets. The third argument of `replace` must contain a string specifying with what to replace every target. It allows "capture references" by number (position). Read more about `replace()` here: http://www.w3.org/TR/xpath-functions/#func-replace – Dimitre Novatchev Sep 21 '12 at 11:36
2

Sir, is there an XPath question that you have not answered, or do not know the answer to? :-) – Alptigin Jalayr Aug 30 '13 at 19:02
1

@AlptiginJalayr, It can be immediately seen that I haven't attempted to answer all XPath questions at SO. :) – Dimitre Novatchev Aug 30 '13 at 19:39
The possibility to combine `tokenize` and `matches` was new to me. Has helped in my case, thank you very much. – Stefan Jung Mar 31 '22 at 10:01
@StefanEike Yes, almost all XPath function can be combined and composed (take as argument the result of a previously called function). – Dimitre Novatchev Mar 31 '22 at 13:22

score 4 · Answer 2 · answered Sep 21 '12 at 09:08

Alright, after referencing answers and comments by wonderful people here, I summarized my findings with this solution which I opted for. Here goes,

concat("AAA", substring(substring-after(., "AAA"), 1, 9)).

So I firstly, substring-after the string with "AAA" as the 1st argument, with the length of 1 to 9...anything more, is ignored. Then since I used the AAA as a reference, this will not appear, thus, concatenating AAA to the front of the value. So this means that I will get the 1st 9 digits after AAA and then concat AAA in front since its a static data.

This will allow the data to be correct no matter what other contributions there is.

But I like the regex by @Dimitre. The replace part. The tokenize not so as what if there isn't space as the argument. The replace with regex, this is also wonderful. Thanks.

And also thanks to you guys out there to...

score 1 · Answer 3 · answered Sep 20 '12 at 07:44

First, I'm pretty sure you don't mean to have the [^ ... ]. That defines a "negative character class", i.e. your current regex says, "Give me a single character that is not one of the following: A0123456789{}". You probably meant, plainly, "AAA(\d{9})". Now, according to this handy website, XPath does support capture groups, as well as backreferences, so take your pick:

"AAA(\d{9})"

And extracting $1, the first capture group, or:

"(?<=AAA)\d{9}"

And taking the whole match ($0).

score 1 · Answer 4 · answered Sep 20 '12 at 08:02

1

Can you try this :

A{3}(\d{9})

answered Sep 20 '12 at 08:02

Stephan

41,764
65
238
329

Regex - Extract a substring from a given string

4 Answers4

Linked