-1

I have the following html

<!-- START: .paragraph-content -->
    <div class="paragraph-content">


            <div class="container"><div class="row"><div class="col-sm-10">

                <!-- START: .paragraph-columns -->
                <div class="paragraph-columns">


                        <div class="field-wysiwyg">
                                <div data-quickedit-field-id="paragraph/167/field_mt_body/en/default" class="field field--name-field-mt-body field--type-text-long field--label-hidden field__items">
                <div class="field__item">
        <h2> </h2>
<h2> </h2>
<h2>INNOVATION.</h2>
<p> </p>
<p> </p>
<p> </p>
<p> </p>

            </div>
          </div>

                        </div>


                </div>
                <!-- END: .paragraph-columns -->

            </div></div></div>


    </div>
    <!-- END: .paragraph-content -->

I want to capture where the html begins with <div class="paragraph-content">

in that block, I want to change the <h2> to <h1>

so the end result will look like this:

<!-- START: .paragraph-content -->
    <div class="paragraph-content">


            <div class="container"><div class="row"><div class="col-sm-10">

                <!-- START: .paragraph-columns -->
                <div class="paragraph-columns">


                        <div class="field-wysiwyg">
                                <div data-quickedit-field-id="paragraph/167/field_mt_body/en/default" class="field field--name-field-mt-body field--type-text-long field--label-hidden field__items">
                <div class="field__item">
        <h2> </h2>
<h2> </h2>
<h1>INNOVATION.</h1>
<p> </p>
<p> </p>
<p> </p>
<p> </p>

            </div>
          </div>

                        </div>


                </div>
                <!-- END: .paragraph-columns -->

            </div></div></div>


    </div>
    <!-- END: .paragraph-content -->

I have tried it with this regex pattern but nothing works:

'/(?:<h2((?!\s").*?)?>)(.*?)(?:<\/h2>)/si'
unixmiah
  • 3,081
  • 1
  • 12
  • 26
  • 1
    Regex might not be the right tool for this. [Ref](https://stackoverflow.com/a/1732454) – Matt.G Apr 02 '19 at 13:29
  • 1
    Use [DOMDocument](https://www.php.net/manual/en/class.domdocument.php) instead of regex. – nice_dev Apr 02 '19 at 13:30
  • There are 3 `h2` tags, and you've changed only one in expected output. Is this a mistake? If you wanted to change only this one you need to explain why. Is it because it's exaclty 3rd or because it has contents and others don't, or other reasons? – shudder Apr 02 '19 at 14:08
  • @shudder the one with the content is what I want to replace. – unixmiah Apr 02 '19 at 14:09

2 Answers2

0

If you have the HTML page as a string variable, accomplished by:

$fileStr = file_get_contents('HTML_FILE.htm');

You can then find the start of the section you are after by using the text "<!-- START: .paragraph-content -->" and the end of the section of the string by using the text "<!-- END: .paragraph-content -->".

Having the start and end of the string, we can extract the portion of the $fileStr in which we want to run our regular expression against.

The regular expression required to find the string you want to change is:

<h2>.{2,}<\/h2>

The issue you have to to extract and replace the <h2> and </h2> with <h1> and </h1> whilst retaining everything in between these.

Doing that isn't going to be a simple neat solution. I would do a loop which would look for <h2>, then find if there is any alphanumerics between that and the closing </h2>, then extract the contents between the two if there is, replacing the tags appropriately.

Whilst not providing you with code to cut and paste, I hope I've given you something to ponder.

Jim Grant
  • 1,128
  • 2
  • 13
  • 31
  • Can you provide an example? – unixmiah Apr 02 '19 at 14:05
  • How is the HTML generated? Just that if you are only wanting to replace the third

    tag or any

    with content, the solution will be different. I'm thinking that if the

    tag to be substituted is in the same place, relative to the HTML section, then it becomes a lot easier.

    – Jim Grant Apr 03 '19 at 07:30
0

Regex works as a finite state machine, it has no way to parse recursive things, like XML tags that might contain other XML tags.

Basically, you cant match exactly the closing tag that matches the opening tag, because that requires recursion, which is not possible in finite state machines (there is Python module regex that has recursion and some other implementations, but this is not true regex).

For your problem exaclty you need a whole top-down recursive parser or some tool that works with XML/HTML specifically.

Just replacing the h2 tags with h1 in the whole regex'ed string is as simple as <(/?)h2> -> <$1h1> though.

necauqua
  • 104
  • 8