Regular expression to split one big html table to several tables of 5 rows

Question

I'm trying to sort out with regExps, so I met a problem on my way: The thing is I have some random HTML file with plain text and only one table. Text can be before and after the table, table doesn't include <thead><tbody><tfoot> rowspan and so on. So I need to split this table into several tables with 5 rows each and 5 or less the last one, with repeating first string of the original table in each table. So for example:

<table>
  <tr>
   <td>A</td><td>B</td>
  </tr>
  <tr>
   <td>A1</td><td>B1</td>
  </tr>
  <tr>
   <td>C</td><td>D</td>
  </tr>
  <tr>
   <td>E</td><td>F</td>
  </tr>
  <tr>
   <td>E1</td><td>F1</td>
  </tr>
  <tr>
   <td>E2</td><td>F2</td>
  </tr>
  <tr>
   <td>E3</td><td>F3</td>
  </tr>
  <tr>
   <td>E4</td><td>F4</td>
  </tr>
</table>

Should become:

<table>
  <tr>
   <td>A</td><td>B</td><--!!!(not needed to be in code)-->
  </tr>
  <tr>
   <td>A1</td><td>B1</td>
  </tr>
  <tr>
   <td>C</td><td>D</td>
  </tr>
  <tr>
   <td>E</td><td>F</td>
  </tr>
  <tr>
   <td>E1</td><td>F1</td>
  </tr>
</table>
<table>
  <tr>
   <td>A</td><td>B</td><--!!!(not needed to be in code)-->
  </tr>
  <tr>
   <td>E2</td><td>F2</td>
  </tr>
  <tr>
   <td>E3</td><td>F3</td>
  </tr>
  <tr>
   <td>E4</td><td>F4</td>
  </tr>
</table>

This stuff I need to be done using PCRE in PHP including massives of templates and changes. So I have problems with realisation. For now I can find the first row like this <table>\s*?(<tr>(?:\s|.)*?<\/tr>) and 4 going one-by-one rows (<tr>(?:\s|.)*?<\/tr>\s*){1,4} but I cannot get how should I find all the occurrences of the second template so I can use them later on and how to stop searching if there is </table> table closing tag. So please help

EDIT

question has been answered so the next level of it to add in original table tags <thead><tbody><tfoot>. In output tables the structure of original table should be reconstructed, so I mean if the first row of original table was part of <thead> tag it should be in <thead> is all output tables.

Maybe just going with [DOMDocument](http://php.net/manual/en/class.domdocument.php) would be easier at this point? :) — IncredibleHat, Dec 17 '17 at 16:48
Ah, its the principle of the matter... must ... make ... it ... work... !!! — IncredibleHat, Dec 17 '17 at 16:50
And also I know that its not like to be used in real project just for me — lolmosk, Dec 17 '17 at 16:50
Oh I'm terrible with regex. Need a regexpert to pop in here and show their magic. — IncredibleHat, Dec 17 '17 at 16:53
@lolmosk Because of the typo (which you have fixed by now), I have completely misunderstood your comment, reading something unacceptably rude into it. I now see the harmless meaning and apologise for any inconvenience I may have caused. — Yunnosch, Dec 17 '17 at 17:26
Since I happen to be half of such a wizard (know Perl-compatible regex, but not not PHP), I would like to help you. Do I understand correctly that you want to split one table into two and start the second one with the same row as the first, augmented by a comment saying that it is not needed in code, then the first table has four of the following input lines and the second table has all the remaining lines (how ever many they are)? — Yunnosch, Dec 17 '17 at 17:35
Since I lack PHP knowledge, can you (or anybody else) provide the PHP line for applying a regex `s/find/replace/` in a loop until it does not replace anything anymore? Note that is is probably not achieved with a `s///g`. — Yunnosch, Dec 17 '17 at 17:38
However, html/xml manipulation via regex is tedious, risky and usually unsatisfying on the long run. Did you consider using a dedicated tool? — Yunnosch, Dec 17 '17 at 17:39
Would something like this work? `$result = preg_replace('%((\s*.*?){5})%sm', "\$1\r\n\r\n", $subject);` — Lieven Keersmaekers, Dec 17 '17 at 17:46
@Yunnosch yeah u understood right but i have no limit of tables to be done, its quantity is regulated by the condition that output tables shold be only 5 rows long. So u take first row of original table (copy it in all output tables without comment it was just for example) then take 4 next rows and make new table like 1 2 3 4 5 (numbers of rows) then another table like this 1 6 7 8 9 and so on 1 10 11 12 13 14 till the original table ends. — lolmosk, Dec 17 '17 at 17:46
@LievenKeersmaekers it works nearly that way is needed but the first string of original table isn't copied in other tables, tables on output should be like `(first string of original table)(4 more strings) — lolmosk, Dec 17 '17 at 17:57
Don't do this. [Just don't](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). Even if you can make it to work... sooner or later you'll bump in the next issue it has with some particular HTML. Think of HTML comments, unexpected attribute values, `rowspan` and `colspan` attributes, nested tags like `span`, `b`, `i`, ... — trincot, Dec 17 '17 at 18:09
Do you want the code to produce the `<--!!!(not needed to be in code)-->` in the output, or is that just your comment for the purpose of the question? — trincot, Dec 17 '17 at 18:45
@trincot I do not want it to be produced its only for the purpose of the question to highlight this string — lolmosk, Dec 17 '17 at 18:49

trincot · Accepted Answer · 2017-12-18T19:56:15.287

You can achieve this by performing a loop where each iteration will add the next "table break" with preg_replace (but see disclaimer at the end). The proposed regular expression will find the following groups:

The last occurrence of the <table> tag together with the first row that follows it, or, if there is a thead and/or tbody tag, up until the closing </thead> tag, including the opening <tbody> tag if there is one.
The 4 next rows that follow that. There must be 4 of them.

It then also looks ahead to ensure that there is at least one more row present.

With that information a single "table break" can be injected into the HTML string.

If the table has a tfooter part (which should then also be repeated in each partition of the table), we will not yet have that information, as it occurs at the very end of the input. For that reason a separate parsing is necessary, before the loop starts, to extract the footer.

This is the code assuming the input is in the variable $html:

// Extract the footer part (if there is one) and closing table tag
preg_match("#(\s*(</tbody|<tfooter).*?)?</table>#s", $html, $tableEnd);
$tableEnd = $tableEnd[0];

// Add a table break in each iteration as long as the last partition has more than 4 rows:
while (true) {
    $res = preg_replace("#(<table(?!.*<table).*?/tr>(?:.*?/thead>)?(?:.*?<tbody>)?)((?:.*?/tr>(?=\s*<tr)){4})#s", 
                        "$1$2$tableEnd\n$1", $html);
    if (strlen($res) === strlen($html)) break;
    $html = $res;
}

echo $res;

See it run on eval.in.

Explanation of the main regex

Here are a few highlights in the main regex:

#: I used this as regex delimiter instead of / to avoid the need to escape / inside the regex itself. If you need to use / as delimiter then escape each / as \\/: one backslash is for the regex and one more to escape that backslash in the context of the string literal.
(?!.*<table): makes sure that there is no other <table> tag following the one we are about to match. It is a negative look ahead.
((?:.*?/tr>(?=\s*<tr)){4}): grabs 4 rows, and with positive look ahead ((?= )) requires that each row is immediately followed by another row. The (?: ) pattern does not make a capture group, but the outer parentheses do create one.

The replacement

If the replacement would just inject again the 2 captured groups (i.e. $1$2), then nothing would have changed. The additional $tableEnd\n$1 will close the table (with the footer) and start a next one by reusing the first capture group. That will have the opening <table> tag with the first row and/or table header included.

Disclaimer

Although the above may work in many cases, it is quite possible to break it, as regular expressions are not the ideal way to parse/interpret HTML. You should really use a DOM api for that, and PHP has one: DOMDocument.

Thanks alot it works as should. One more little question: is it possible to include usage of `tfoot>` without `rowspan, colspan, b, i` and other tags just those 3. — lolmosk, Dec 17 '17 at 19:31
Sure, but maybe it would be good if you tried it first yourself, now that you have the idea how to approach this. Also, it would be good to add an "##edit" section to your question, and add the input and desired output *incuding* those extra tags. — trincot, Dec 17 '17 at 20:19
Made some edits to question so gimme some hours to try it myself:) — lolmosk, Dec 17 '17 at 20:45
didnt cape with it for today the idea is in these 4 rows we need to find we look for one of tags like submask and then we replace it and close the tag, cannot still get how to get the right place to close the tag:as I think if we can find both opening and closing tags everything is simple but if we have only one-side tag I dunno how to get the right place — lolmosk, Dec 17 '17 at 23:08
for example speaking about opening tag: we did find the closing one and logically want to open it in the begining of the table but we can take part in first-row construction, simply speaking we can get: `
Lorem ipsum
dolor sit amet` which is wrong as w3c validator says — lolmosk, Dec 17 '17 at 23:11
`$res = preg_replace("#()(?(?=)())((?:.*?/tr>){4})(?=.*/tr>)#s", "$1$2$3\n

\n$1", $html);` this code is my limit of dummyness so this work only when whole table is in `` i don't know how to fix this, tried alot but no results:( — lolmosk, Dec 18 '17 at 14:38
OK, I have updated my answer to support optional `thead`, `tbody` and `tfooter` tags. — trincot, Dec 18 '17 at 19:56
hello, friend, would you mind helping me again with one similar task has no time for real to break my brain, please:) — lolmosk, Dec 25 '17 at 21:00
I'm not available during this holiday period. Ask a new question and others will try to help you out. — trincot, Dec 27 '17 at 17:25

Regular expression to split one big html table to several tables of 5 rows

EDIT

1 Answers1

Explanation of the main regex

The replacement

Disclaimer

Lorem ipsum