2

I know parsing HTML is discouraged but I know for a fact it is the best option under controlled circumstances. I my case, I am in need of a regex to find all of the pre-formatted text (<pre>) statements on an html page. This seems easy enough to simply google but I find no results. In addition, the <pre> statement needs to contain a string, in my case, "gisformat". In others words, this regex needs to return all of the pre-formatted text statements in an HTML file that contain "gisformat". I know it goes something like this but I'm not sure what to put in the middle: /<pre>(what should I put here)</pre>/

EDIT 1: I am using PHP and yes, I have seen this post including answer #2 RegEx match open tags except XHTML self-contained tags

Community
  • 1
  • 1
  • 3
    Why not use a DOM parser? You could get the collection of `
    ` elements very easily with that and then search within those items in the collection.  This could probably be done without regex at all as it seems you are needing to do a simple string search within the elements. Also what language are you using?
    – Mike Brant Dec 19 '13 at 18:34
  • Discouraged is a light way of putting it – scrblnrd3 Dec 19 '13 at 18:35
  • 1
    *"I know for a fact it is the best option under controlled circumstances"*. This would be an opinion, and a bad one. Using [Html Agility Pack](http://htmlagilitypack.codeplex.com/) would be a much better way to go. – Harrison Dec 19 '13 at 18:36
  • A DOM parser might work, I didn't know about that...is there a package to install for PHP? – user3048179 Dec 19 '13 at 18:43
  • @user3048179: You don't need to install anything to use DOMDocument or XPath with PHP. – Casimir et Hippolyte Dec 19 '13 at 19:10

3 Answers3

1

A regex way:

preg_match_all('~<pre\b[^>]*>(?>[^<g]+|g(?!isformat)|<(?!/pre))*gisformat(?>[^<]+|<(?!/pre>))*</pre>~', $subject, $matches);
print_r($matches[0]);

A DOM way:

$doc = new DOMDocument();
@$doc->loadHTML($subject);
$preNodes = $doc->getElementsByTagName('pre');

foreach($preNodes as $preNode) {
    if (strpos($preNode->nodeValue, 'gisformat') !== false)
        $result[] = $preNode->ownerDocument->saveXML($preNode);
}
print_r($result);

Pattern details:

 # the opening tag #
<pre\b [^>]* >

 # content before the first "gisformat" #
(?>
    [^<g]+         # all that is not a "<" or a "g"
  |               # OR
    g(?!isformat)  # a "g" not followed by "isformat"
  |               # OR
    <(?!/pre)      # a "<" not followed by "/pre" 
)*                # repeat the group zero or more times

 # target #
gisformat

 # content until the closing tag #
(?>[^<]+|<(?!/pre>))*

 # closing tag #
</pre>
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • Wow, that regex is longer than I expected. Could you edit and break the regex into each group based on what it does please? I'm fairly new to regexes and that is somewhat hard to comprehend. Thanks! – user3048179 Dec 19 '13 at 18:48
  • @Marabunta: To make it work with an online tester, you must remove the ~ delimiters and escape all the slashes of the pattern, because slashes are often the default delimiters in online testers. – Casimir et Hippolyte Dec 20 '13 at 00:48
0

Provided you have read the previous posts and understand the shortcomings of trying to use RegEx on HTML, we can provide a BASIC regex that should get the job done. Try using

<pre>[^<]*?gisformat[^<]*?</pre>

This will NOT work if there are other tags nested within the

 area, but it is as specific as I can get easily without seeing your sample. As stated in the article you reference above, RegEx match open tags except XHTML self-contained tags, "it's sometimes appropriate to parse a limited, known set of HTML"

If you have sample HTML, use my regex as a starter and use http://regexpal.com/ for tweaking.

Community
  • 1
  • 1
mrfish
  • 64
  • 7
  • So, if I understand correctly, if there are no HTML tags between the pre tags and the known HTML is guaranteed to be formatted in a known way, this should always work. Correct? – user3048179 Dec 19 '13 at 19:06
  • That should always work. Again, I would test on real HTML using regexpal , but I don't see any reasons it wouldn't – mrfish Dec 22 '13 at 08:16
0

If the tags are balanced and engine is PCRE compatible, this will find gisformat in the middle somewhere.

Processed by RegexFormat4.

 # '~(?s)(?&PRE)(?(DEFINE)(?<PRE><pre>(?:(?&Core)gisformat(?&Core))</pre>)(?<Core>(?:(?>(?:(?!</?pre>|gisformat).)*)|(?&PRE)|(?&POST))*)(?<POST><pre>(?:(?:(?>(?:(?!</?pre>).)*)|(?&POST))*)</pre>))~'

 # -----------------------------------
 (?s)           # Dot-All modifier
 (?&PRE)        # Main function call
 # -----------------------------------

                # Subroutines

 (?(DEFINE)
      (?<PRE>
           <pre>
           (?:
                (?&Core) 
                gisformat
                (?&Core) 
           )
           </pre>
      )
      (?<Core>
           (?:
                (?>
                     (?:
                          (?! </?pre> | gisformat )
                          . 
                     )*
                )
             |  
                (?&PRE) 
             |  
                (?&POST) 
           )*
      )
      (?<POST>
           <pre>
           (?:
                (?:
                     (?>
                          (?:
                               (?! </?pre> )
                               . 
                          )*
                     )
                  |  
                     (?&POST) 
                )*
           )
           </pre>
      )
 )