2

An example describes it better. Suppose you have a structure like this:

<h1>TITLE OF HEAD 1</h1>
<table>
    <tbody>
        <tr>
            <td class="one">ITEM 1, AFTER HEAD 1</td>
        </tr>
        <tr>
            <td class="one">ITEM 2, AFTER HEAD 1</td>
        </tr>
    </tbody>
</table>
<table>
    <tbody>
        <tr>
            <td class="one">ITEM 3, AFTER HEAD 1</td>
        </tr>
        <tr>
            <td class="one">ITEM 4, AFTER HEAD 1</td>
        </tr>
        <tr>
            <td class="one">ITEM 5, AFTER HEAD 1</td>
        </tr>
    </tbody>
</table>
<h1>TITLE OF HEAD 2</h1>
<table>
    <tbody>
        <tr>
            <td class="one">ITEM 6, AFTER HEAD 2</td>
        </tr>
    </tbody>
</table>
<h1>TITLE OF HEAD 3</h1>
<table>
    <tbody>
        <tr>
            <td class="one">ITEM 7, AFTER HEAD 3</td>
        </tr>
        <tr>
            <td class="one">ITEM 8, AFTER HEAD 3</td>
        </tr>
        <tr>
            <td class="one">ITEM 9, AFTER HEAD 3</td>
        </tr>
        <tr>
            <td class="one">ITEM 10, AFTER HEAD 3</td>
        </tr>
    </tbody>
</table>
<h1>TITLE OF HEAD 4</h1>
<table>
    <tbody>
        <tr>
            <td class="one">ITEM 11, AFTER HEAD 4</td>
        </tr>
        <tr>
            <td class="one">ITEM 12, AFTER HEAD 4</td>
        </tr>
    </tbody>
</table>

And with regex, the outcome should be:

<table>
    <tbody>
        <tr>
            <td class="one">ITEM 1, AFTER HEAD 1</td>
            <td class="two">TITLE OF HEAD 1</td>
        </tr>
        <tr>
            <td class="one">ITEM 2, AFTER HEAD 1</td>
            <td class="two">TITLE OF HEAD 1</td>
        </tr>
    </tbody>
</table>
<table>
    <tbody>
        <tr>
            <td class="one">ITEM 3, AFTER HEAD 1</td>
            <td class="two">TITLE OF HEAD 1</td>
        </tr>
        <tr>
            <td class="one">ITEM 4, AFTER HEAD 1</td>
            <td class="two">TITLE OF HEAD 1</td>
        </tr>
        <tr>
            <td class="one">ITEM 5, AFTER HEAD 1</td>
            <td class="two">TITLE OF HEAD 1</td>
        </tr>
    </tbody>
</table>
<h1>TITLE OF HEAD 2</h1>
<table>
    <tbody>
        <tr>
            <td class="one">ITEM 6, AFTER HEAD 2</td>
            <td class="two">TITLE OF HEAD 2</td>
        </tr>
    </tbody>
</table>
<h1>TITLE OF HEAD 3</h1>
<table>
    <tbody>
        <tr>
            <td class="one">ITEM 7, AFTER HEAD 3</td>
            <td class="two">TITLE OF HEAD 3</td>
        </tr>
        <tr>
            <td class="one">ITEM 8, AFTER HEAD 3</td>
            <td class="two">TITLE OF HEAD 3</td>
        </tr>
        <tr>
            <td class="one">ITEM 9, AFTER HEAD 3</td>
            <td class="two">TITLE OF HEAD 3</td>
        </tr>
        <tr>
            <td class="one">ITEM 10, AFTER HEAD 3</td>
            <td class="two">TITLE OF HEAD 3</td>
        </tr>
    </tbody>
</table>
<h1>TITLE OF HEAD 4</h1>
<table>
    <tbody>
        <tr>
            <td class="one">ITEM 11, AFTER HEAD 4</td>
            <td class="two">TITLE OF HEAD 4</td>
        </tr>
        <tr>
            <td class="one">ITEM 12, AFTER HEAD 4</td>
            <td class="two">TITLE OF HEAD 4</td>
        </tr>
    </tbody>
</table>

What I've tried so far:

Now getting the strings inside the <h1> is easy:

find: (<h1>)(.*?)(</h1>) replace: $2

Then I tried:

find: (<h1>)(.*?)(</h1>)(\n|.)*?(<td class="one">.*?</td>) replace: $5<td class="two">$2</td>

which works, but the other tags are removed as well, so I've modified it:

find (<h1>)(.*?)(</h1>)((\n|.)*?)(<td class="one">.*?</td>) replace: $4$6<td class="two">$2</td>

Each string of a new h1 will be used for the tds that occur afterwards until a new h1 occurs, which will then be used - the problem is this only works for each first tdafter each h1, not all tds.

Could somebody tell me what needs to be added to the regex for this to work?

Thank you!

stst
  • 83
  • 1
  • 8
  • 1
    You can't match the same substring several times. That's why your approach doesn't work. A workaround consists to use a lookbehind that doesn't consume the substring you are interested by. You can do that: https://regex101.com/r/nwQtZM/1 but you need a second pass (with a more simple pattern) to remove the h1 tags. – Casimir et Hippolyte Jun 08 '22 at 18:31
  • 3
    Also possible in one pass: https://regex101.com/r/nwQtZM/2 – Casimir et Hippolyte Jun 08 '22 at 19:22

1 Answers1

0

Use

<h1>([^<]*)<\/h1>\s*\n([\w\W]*?)(([^\n\S]*)<td\s.*?<\/td>(\n))(?=\s*<\/tr>)|(?<=<h1>([^<]*)<\/h1>[\w\W]*?)(([^\n\S]*)<td\s.*?<\/td>(\n))(?=\s*<\/tr>)

See regex proof.

Replace with: $2$3$4$7$8<td class="two">$1$6</td>$5$9.

EXPLANATION

NODE                     EXPLANATION
--------------------------------------------------------------------------------
  <h1>                     '<h1>'
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    [^<]*                    any character except: '<' (0 or more
                             times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  </h1>                    '</h1>'
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  \n                       '\n' (newline)
--------------------------------------------------------------------------------
  (                        group and capture to \2:
--------------------------------------------------------------------------------
    [\w\W]*?                 any character of: word characters (a-z,
                             A-Z, 0-9, _), non-word characters (all
                             but a-z, A-Z, 0-9, _) (0 or more times
                             (matching the least amount possible))
--------------------------------------------------------------------------------
  )                        end of \2
--------------------------------------------------------------------------------
  (                        group and capture to \3:
--------------------------------------------------------------------------------
    (                        group and capture to \4:
--------------------------------------------------------------------------------
      [^\n\S]*                 any character except: '\n' (newline),
                               non-whitespace (all but \n, \r, \t,
                               \f, and " ") (0 or more times
                               (matching the most amount possible))
--------------------------------------------------------------------------------
    )                        end of \4
--------------------------------------------------------------------------------
    <td                      '<td'
--------------------------------------------------------------------------------
    \s                       whitespace (\n, \r, \t, \f, and " ")
--------------------------------------------------------------------------------
    .*?                      any character except \n (0 or more times
                             (matching the least amount possible))
--------------------------------------------------------------------------------
    </td>                    '</td>'
--------------------------------------------------------------------------------
    (                        group and capture to \5:
--------------------------------------------------------------------------------
      \n                       '\n' (newline)
--------------------------------------------------------------------------------
    )                        end of \5
--------------------------------------------------------------------------------
  )                        end of \3
--------------------------------------------------------------------------------
  (?=                      look ahead to see if there is:
--------------------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    </tr>                    '</tr>'
--------------------------------------------------------------------------------
  )                        end of look-ahead
--------------------------------------------------------------------------------
 |                        OR
--------------------------------------------------------------------------------
  (?<=                     look behind to see if there is:
--------------------------------------------------------------------------------
    <h1>                     '<h1>'
--------------------------------------------------------------------------------
    (                        group and capture to \6:
--------------------------------------------------------------------------------
      [^<]*                    any character except: '<' (0 or more
                               times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
    )                        end of \6
--------------------------------------------------------------------------------
    </h1>                    '</h1>'
--------------------------------------------------------------------------------
    [\w\W]*?                 any character of: word characters (a-z,
                             A-Z, 0-9, _), non-word characters (all
                             but a-z, A-Z, 0-9, _) (0 or more times
                             (matching the least amount possible))
--------------------------------------------------------------------------------
  )                        end of look-behind
--------------------------------------------------------------------------------
  (                        group and capture to \7:
--------------------------------------------------------------------------------
    (                        group and capture to \8:
--------------------------------------------------------------------------------
      [^\n\S]*                 any character except: '\n' (newline),
                               non-whitespace (all but \n, \r, \t,
                               \f, and " ") (0 or more times
                               (matching the most amount possible))
--------------------------------------------------------------------------------
    )                        end of \8
--------------------------------------------------------------------------------
    <td                      '<td'
--------------------------------------------------------------------------------
    \s                       whitespace (\n, \r, \t, \f, and " ")
--------------------------------------------------------------------------------
    .*?                      any character except \n (0 or more times
                             (matching the least amount possible))
--------------------------------------------------------------------------------
    </td>                    '</td>'
--------------------------------------------------------------------------------
    (                        group and capture to \9:
--------------------------------------------------------------------------------
      \n                       '\n' (newline)
--------------------------------------------------------------------------------
    )                        end of \9
--------------------------------------------------------------------------------
  )                        end of \7
--------------------------------------------------------------------------------
  (?=                      look ahead to see if there is:
--------------------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    </tr>                    '</tr>'
--------------------------------------------------------------------------------
  )                        end of look-ahead
Ryszard Czech
  • 18,032
  • 4
  • 24
  • 37
  • The above regex doesn't handle newlines in the right way for vscode. See https://regex101.com/r/Xi8MKz/1 generally you just have to add `\r?` or `\r` in front of the `\n`'s. Compare the newline handling to Casimir's which is correct. – Mark Jun 10 '22 at 19:23
  • 1
    I have tested in my Visual Studio Code and it works. – Ryszard Czech Jun 10 '22 at 21:21
  • 2
    @PoulBak I see [it works](https://imgur.com/a/KP27Lge). Note VSCode does not use .NET regex engine in the search and replace feature, it uses the ECMAScript 2018 compliant engine. – Wiktor Stribiżew Jun 14 '22 at 15:01
  • My bad, I thought VS and VS Code used the same engine. – Poul Bak Jun 14 '22 at 20:30