1

Anyone handy with regular expressions?..

I'm running the following RegEx:

<body>.*</body>    

On the following text:

<text>initial text</text>
 <comment>
   <user>
     6
   </user>
   <date>
     635277984371174139
   </date>
   <body>
     Recorded clinical data: 0132.00 Managing director dawd
   </body>
 </comment>
 <comment>
   <user>
     6
   </user>
   <date>
     635277984559612059
   </date>
   <body>
     Recorded clinical data: 0132.00 Managing director ii
   </body>
 </comment>
 <comment>
   <type>
     Completed
   </type>
   <user>
     6
   </user>
   <date>
     635277984668163579
   </date>
   <body>
     kkk
   </body>
 </comment>

However, this only results in one match...I would expect 3 matches..does anyone have any idea why?

brothers28
  • 1,196
  • 17
  • 23
  • What language are you using? Perl? Javascript? R? ;) You could edit the question and provide the language as a tag. It might help get the exact answer. (See my comment about using the `/g` flag on your regex) – Jess Feb 17 '14 at 19:58
  • You are not capturing anything, so if there is at least 1 occurrence, than it will return it. – mrres1 Feb 17 '14 at 20:00
  • Good point about capturing. @user3320546, do you want to include the `body` tags in your results or just the "inner HTML"? – Jess Feb 17 '14 at 20:01
  • So, either `(.*?)` or `(.*?)` – mrres1 Feb 17 '14 at 20:04
  • Hi, thanks for the help guys. I've also added a c# tag. – user3320546 Feb 18 '14 at 10:19

2 Answers2

3
  1. You shouldn't parse HTML with regex (unless trivial & constant snippets of HTML), you risk weird bugs: RegEx match open tags except XHTML self-contained tags

  2. Your regex is failing because * is a greedy quantifier. It means it will "eat" as much as possible: this will match from the first <body> to the last one, including the inside ones. What you want is

    <body>.*?</body> 
    

    The ? makes the quantifier non-greedy, it will stop at the first match.

  3. You should edit your question, as your HTML is currently non-readable.

Community
  • 1
  • 1
Robin
  • 9,415
  • 3
  • 34
  • 45
1

Your expression is greedy. .* will match everything till the end and then backtrack to the point where <\/body> is found.

You need to make your regex lazy, like this -

<body>.*?<\/body>

Demonstrated here

Kamehameha
  • 5,423
  • 1
  • 23
  • 28
  • +1 You may also need to add the global switch like this: `htmlString.match(/.*?<\/body>/g);` (This is javascript) – Jess Feb 17 '14 at 19:56
  • Yep, in the link that I've added, I am using a global flag(PHP), but since there was no mention of any language, didn't mention it. – Kamehameha Feb 17 '14 at 19:58
  • Escaping `/` isn't useful unless you enclose your regex in forward slashes. You may want to show that in your regex: `/.*?<\/body>/` – Robin Feb 17 '14 at 20:03