0

I have a string which contains multiple occurrences of the "<p class=a> ... </p>" where ... is different text.

I am using "<p class=a>(.*)</p>" regex pattern to split the text into chunks. but this is not working. what would be the correct regex for this?

P.S. the same regex pattern is working in iOS using NSRegularExpression but not working in android using Pattern.

To explain my problem more : i am doing the following

Pattern regex3 = Pattern.compile("(?s)<P Class=ENCC>(.*?)</P>", CASE_INSENSITIVE);
String[] result = p.split(str); 

result array contains only 1 item and it is the whole string

and the following is a portion of the file that i am reading :

<BODY>
    <SYNC Start=200>
      <P Class=ENCC><i>Cerita, Watak, Adegan dalam</i><br/><i>Drama Ini Rekaan Semata-Mata.</i></P>
    </SYNC>
    <SYNC Start=2440>
      <P Class=ENCC>&nbsp;</P>
    </SYNC>
    <SYNC Start=2560>
      <P Class=ENCC><i>Kami Tidak Berniat</i><br/><i>Melukakan Hati Sesiapa.</i></P>
    </SYNC>
    <SYNC Start=4560>
      <P Class=ENCC>&nbsp;</P>
    </SYNC>
    <SYNC Start=66160>
      <P Class=ENCC>Hai kawan-kawan.<br/>Inilah bandaraya Banting.</P>
    </SYNC>

UPDATE ::::

hi everybody, I have got the problem. the problem was actually with the encoding of the file that i was reading. the file was UTF-16 (Little Endian) encoded. that was causing the all problem of regex not working. i changed it to UTF-8 and everything started working .. thanx everybody for your support.

g.revolution
  • 11,962
  • 23
  • 81
  • 107

4 Answers4

2

Parsing HTML with regular expressions is not really a good idea (reason here). What you should use in an HTML parser such as this.

That being said, your issue is most likely the fact that the * operator is greedy. In your question you just say that it is not working, so I think that your problem is because it is matching the first <p class=a> and the very last </p>. Making the regular expression non greedy, like so: <p class=a>(.*?)</p> (notice the extra ? to make the * operator non greedy) should solve the problem (assuming that your problem is the one I have stated earlier).

That being said, I would really recommend you ditch the regular expression approach and use appropriate HTML Parsers.

Community
  • 1
  • 1
npinti
  • 51,780
  • 5
  • 72
  • 96
  • thi is not actually a html file (although it is using html tags). it is sort or a custom subtitle file that is using html tags. and it is not validated also (because of other non html stuff in the html file). also i have used the

    (.*?)

    too, and it is not working either.
    – g.revolution Jun 26 '12 at 09:12
  • @g.revolution: If that is the case then I would recommend you provide more information, such as what you actually have, what you are after and what you actually getting. – npinti Jun 26 '12 at 09:17
1

EDIT:

Now that you've posted the code and the text you're matching against, one thing immediately leaps to mind:

You're matching <p class..., but your string contains <P Class.... Regexes are case-sensitive.

Then, . does not match newlines. And it's quite likely that your paragraphs do contain newlines.

Therefore, try "(?si)<p class=a>(.*?)</p>". The (?s) modifier allows the dot to match newlines, too, and the (?i) modifier makes the regex case-insensitive.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • i have tried the expression you mentioned too. and it is not working. and i am using Pattern class like this : Pattern p = Pattern.compile("(?s)

    (.*?)

    "); String[] result = p.split(str); result contains only 1 item and it is the whole string .. thats what i am getting
    – g.revolution Jun 26 '12 at 09:29
  • i have tried with the Pattern.CASE_INSENSITIVE and still not working. – g.revolution Jun 26 '12 at 09:49
  • It works fine here. Are you aware that when using `split()`, the regex match will *not* be part of the result? – Tim Pietzcker Jun 26 '12 at 10:08
0

I guess the problem is that your pattern is greedy. You should use this instead.

"<p class=a>(.*?)</p>"

If you have this string:

"<p class=a>fist</p><p class=a>second</p>"

Your pattern ("<p class=a>(.*)</p>") will match this

"<p class=a>fist</p><p class=a>second</p>"

While "<p class=a>(.*?)</p>" only matches

"<p class=a>fist</p>"
flec
  • 2,891
  • 1
  • 22
  • 30
0

The .* may match <. You can try :

<p class=a>([^<]*)</p>
Arcadien
  • 2,258
  • 16
  • 26