0

I have an Inline XML file(xml tags + text). I want to grab 4 words before a specific tag. For eg:

Case 1:

I used to live in <Location>London</Location>.

Case 2:

I work for <Organization> Microsoft Tech.</Organization>
which is in <Location>London</Location>

I want to grab 4 words before the location tag in both the cases.

OUTPUT:

Case 1:

used to live in

Case 2:

</Organizattion> which is in

Is this possible ?? Can anyone please help me ?

anubhava
  • 761,203
  • 64
  • 569
  • 643
  • See: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags ;-) – Matthias Mar 12 '12 at 14:23
  • @winSharp93 The OP's particular problem is simple to the extent that it can be solved by regular expressions. It's surprising how many people regurgitate that you can't apply regular expressions to xml without understanding the reasons why. – Sam I am says Reinstate Monica Mar 21 '12 at 18:31

3 Answers3

1

Well, the easiest would be:

((?:\S+\s+){4}\s*)<Location>

Note that this will not yield the desired result in your second case, since it assumes non-whitespace runs delimited by whitespace, so it would yield Tech.</Organization> which is in there.

Joey
  • 344,408
  • 85
  • 689
  • 683
  • I already have a regexp which does the first case.. I was luking for the one which would also grab in the second case. :( Anywayz thanks for the effort taken ! – user1264228 Mar 19 '12 at 16:51
0

While this is crazy and I would not recommend to use it, you can do something like this with awk:

awk '/<Location>/ {n=gensub("(.*)<Location>.*","\\1","g",$0); print gensub(".*[ .]([^ .]+ [^ .]+ [^ .]+ [^ .]+) *$","\\1","g",n)} ' INPUTFILE

You might want to modify the [^ .] parts to properly decide what is part of a word.

  1. this operates on lines with <Location>
  2. saves part of the line till <Location>
  3. prints the four word it had found. (Note without a match it will print the previously saved line part.)
Zsolt Botykai
  • 50,406
  • 14
  • 85
  • 110
0

The regex you need has to be a positive lookhead based. For your 2 cases following works:

/(?:[<>\/\w]+\s*){4}(?=<Location>)/s

Let me know if you need a demo using above regex.

anubhava
  • 761,203
  • 64
  • 569
  • 643
  • I tried your expression here : http://myregextester.com/index.php I am however not getting the required output. Can you please provide more info.. THanks a lot ! – user1264228 Mar 19 '12 at 16:48
  • I also tried on same page and it is working fine there as well. Make sure you enter regex without `/` there. SO enter `(?:[<>\/\w]+\s*){4}(?=)` in `MATCH PATTERN` and do check flag `s` (its a checkbox). Then enter any of the above 2 text and then click on Submit. – anubhava Mar 19 '12 at 16:53