1

I wanted to try to match the inner part of the string between the span tags where it is guaranteed that the id of this span tags starts with blk.

How can I match this with groovy?

Example :

<p>I wanted to try to <span id="blk1">match</span> the inner part of the string<span id="blk2"> between </span>the span tags <span>where</span> it is guaranteed that the id of this span tags <span id="blk3">starts</span> with blk.</p>

According to the example above,I want to have

   match
   between
   starts

I tried the following , but it returns null;

 def html='''<p>I wanted to try to <span id="blk1">match</span> the inner part of the string<span id="blk2"> between </span>the span tags <span>where</span> it is guaranteed that the id of this span tags <span id="blk3">starts</span> with blk.</p>''' 

 html=html.findAll(/<span id="blk(.)*">(.)*<\/span>/).join();
 println html;

Abdennour TOUMI
  • 87,526
  • 38
  • 249
  • 254
  • 2
    Here's the obligatory warning against parsing HTML with regular expressions: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags. Save your sanity and use a proper parser! – ataylor May 03 '13 at 14:53

2 Answers2

4

Rather than messing around with Regular Expressions, why not just parse the HTML and then extract the nodes from it?

@Grab( 'net.sourceforge.nekohtml:nekohtml:1.9.18' )
import org.cyberneko.html.parsers.SAXParser

def html = '''<p>
             |  I wanted to try to <span id="blk1">match</span> the inner part
             |  of the string<span id="blk2"> between </span> the span tags <span>where</span>
             |  it is guaranteed that the id of this span tags <span id="blk3">starts</span>
             |  with blk.
             |</p>'''.stripMargin()

def content = new XmlSlurper( new SAXParser() ).parseText( html )

List<String> spans = content.'**'.findAll { it.name() == 'SPAN' && it.@id?.text()?.startsWith( 'blk' ) }*.text()
tim_yates
  • 167,322
  • 27
  • 342
  • 338
3

You seem to have span on one side and strong on the other.

In addition should be careful with using .* alone, as it will match most of the string in one go because regex is greedy. You should usually make it lazy by using .*?

When you use (.)* to match the text between tags, you will not get out the actual text fro mthat group, but only the last character that was matched, you need to put the quantifier inside the matching group.

Using [^<>]+ is a much better way to match text between html tags, and would be similar to .* except a few points.

  1. It will match any character, except "<" and ">"
  2. It will require to match at least one character, so it will not match an empty span.

Furthermore, if you can ensure that what follows "blk" will always be an integer, I recommend using \d+ to match it.

html=html.findAll(/<=span id="blk\d">([^<>]+)<\/span>/).join();

That being said, I have little experience in Groovy, but you wish that a list containing those three words should be printed? The following regex will extract text from the html as well.

html=html.findAll(/(?<=span id="blk\d">)([^<>]+)(?=<\/span>)/).join();
melwil
  • 2,547
  • 1
  • 19
  • 34
  • Sorry for all the edits, but I kept finding more things to correct. – melwil May 03 '13 at 07:09
  • thank you for your answer .. I make mistake when i write ; it's – Abdennour TOUMI May 03 '13 at 09:25
  • When , I execute your instruction , i get :'match between starts' ...... However , i want the inner HTML ; that's mean : match , between , starts . and If you can modify the pattern , I will be thankful for my whole life. – Abdennour TOUMI May 03 '13 at 09:28
  • 1
    You need to extract the matching groups somewhere. I'm not familiar with Groovy closures, and hoped it would automatically do that. – melwil May 03 '13 at 09:34
  • I have updated my answer with an updated regex which will do what you want, but be aware that it is fragile, since it will only allow one digit to follow "blk". The reason for this is that lookbehind does not support quantifiers. Another solution to your problem would be to just strip all html tags from the list that you produced with the first regex, as that one was more flexible and less taxing on resources. – melwil May 03 '13 at 09:43
  • 1
    No problem. While this is an elegant one line solution, parsing it using tim_yates' solution is safer and will also work if the "blk" values go beyond 1 digit. – melwil May 03 '13 at 09:55
  • You are right; it is not safe . in fact , when i have for example dfgfdg , the pattern does not match it . – Abdennour TOUMI May 03 '13 at 14:40
  • There is a reason why people do not recommend parsing html with regex, it's too fragile. – melwil May 03 '13 at 14:44