0

I'm working on a html parser for a client, and I have just started messing around with RegEx. I'm quite new to it but am learning quickly! In this part, I need to acquire all of the text that is 18.0pt size within the document. Here is the first RegEx I have tried (using a real-time RegEx tester):

<p.*?><span.*?style='.*?font-size:1

Here is my test text:

<p class=MsoNormal><span style='font-size:14.0pt;font-family:"Comic Sans MS"'>3<sup>rd</sup>
Sunday in Lent - 2013c<o:p></o:p></span></p>

<p class=MsoNormal><span style='font-size:14.0pt;font-family:"Comic Sans MS"'>Old
Testament – Isaiah 55:1-9<o:p></o:p></span></p>

<p class=MsoNormal><span style='font-size:14.0pt;font-family:"Comic Sans MS"'>New
Testament – Luke 13:1-9<o:p></o:p></span></p>

<p class=MsoNormal><span style='font-size:18.0pt;font-family:"Comic Sans MS"'><o:p>&nbsp;</o:p>
</span></p>

It works correctly and highlights each line separately until the 1. The problem is, right when I change 1 to 18, instead of highlighting just the line with font-size:18, it highlights ALL THE WAY from the first line until the 18. I would like to just grab the line with 18pt font. Thank you, and any help is appreciated! :)

  • I think I understand why, because I put a nongreedy dot infront of it so it matches until it finds the 18, but I kind of need that there aswell, because in some cases there is extra styling, but some cases not. How can I get around this? –  Mar 09 '13 at 00:11
  • While it may be fun to learn parsing with regex, there are better ways to get what you want. Take a look at this earlier answer http://stackoverflow.com/questions/292926/robust-and-mature-html-parser-for-php to get some ideas. Once the html is parsed, finding what you want is easier... – Floris Mar 09 '13 at 00:12

2 Answers2

2

Here's a better regexp:

<p[^>]*>[ \t\r\n]*<span[^>]* style='[^']*font-size:18

Your one is doing exactly as you told it; finding <p, then any number of arbitrary characters, then ><span, then more arbitrary characters, then font-size:18. So it finds the first <p then all the arbitrary characters until font-size:18. You were just lucky in the first example that all your spans had font-size specified.

This version doesn't allow so much; stopping at any >. Also to make it more robust, I allowed whitespace between the <p> and <span>.

Dave
  • 44,275
  • 12
  • 65
  • 105
  • Just one question, where you have the [ \t\r\n]*, there is a possibility that there is another tag in there (such as ). How can I let it match anything in between those

    and tags?

    –  Mar 09 '13 at 01:19
  • do you need to match the

    at all? It's quite complicated to match arbitrary tags, because regular expressions have no concept of recursion or DOM structure. If you need it to be that complex, you should use a library which is DOM-aware.

    – Dave Mar 09 '13 at 01:29
  • You're right.. I realized that after I finished rewriting it. This is what I came up with: /(.*?)(?:)?(?:<\/o:p>)?<\/span>/si –  Mar 09 '13 at 02:33
  • "Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems." https://groups.google.com/forum/?hl=en&fromgroups=#!msg/alt.religion.emacs/DR057Srw5-c/Co-2L2BKn7UJ – Dave Mar 09 '13 at 02:44
  • The regexp you posted has the same problem as your original; anywhere you use `.*` (incidentally `.*?` is not needed, because `*` means *0 or more* matches), you will match vast swathes of the text that you didn't expect. – Dave Mar 09 '13 at 02:46
0

Instead of matching "any character" (with the dot), if you match "any character except newline" you will make sure not to go beyond the end of a line:

<p.*?><span[^\n]*?style='[^\n]*?font-size:18

Now usually the . doesn't match newline unless certain flags are set (which depends on your environment) - in particular, the s flag. Could it be that's the default for your regex tester?

Another thought would be to limit the number of characters you expect to match with {} - for example

<p.{,20}>

which will work as long as there's not more than 20 characters in your opening <p> tag.

Floris
  • 45,857
  • 6
  • 70
  • 122
  • Hey Floris. Thanks for the response! The thing is, this html page is being exported from a Word Document file, and I've realized that sometimes it adds random \n characters unfortunately.. so I've purposely enabled the /s modifier. Also, sometimes there are paragraphs of text so I can't really estimate the amount of characters. –  Mar 09 '13 at 00:47