0

can you help me, I tried so many options but I'm still unable to get it working. I have this regex

/(?<startSearch>24510<\/td>.+\n).+(\|a(?<title>.+)){0,1}(\|b(?<subtitle>.+)){0,1}(\|c(?<author_info>.+)){0,1}(?<endSearch><\/td>)/u

and the website page in variable, which contains following:

<tr valign=top> 
  <td class=td1 id=bold width="10%" nowrap>24510</td> 
  <td class=td1>|a První kroky z deprese / |c Sue Atkinson ; [z angličtiny přeložil Jindřich Kotvrda]</td> 
</tr>

As you can see on test in link above, I'm able to capture startSearch and endSearch group, but not others (title, subtitle, author_info), which each of them is inside optional groups (agroup, group,cgroup).

Expected output of this example should be:

$match['title'] = ' První kroky z deprese / '
// $match['subtitle'] is not in here, because it doesn't exist in the example
$match['author_info'] = ' Sue Atkinson ; [z angličtiny přeložil Jindřich Kotvrda]'

Are you able to find, where is the problem and show me the solution, please?

EDIT: OK, I rewrote it for SimpleHtmlDom and going through by DOM traversing. Bet there is still the main issue with Capturing groups... I updated the link with new text, which I got from DOM and updated Regex syntax, but it is still not working. It is taking whole text as agroup.

  • 1
    You should *never* parse HTML with regex. Use [a PHP DOM parser](http://simplehtmldom.sourceforge.net/) instead. – Jay Blanchard Jan 02 '20 at 12:48
  • Jay, yep, I got it, but I still need to use regex on final string, look, I added Edit – Petr Kateřiňák Jan 02 '20 at 14:38
  • Hold on a sec, working om it for you. – Jay Blanchard Jan 02 '20 at 14:48
  • With simplified regex http://sandbox.onlinephpfunctions.com/code/b12891cada82758badf566e3adc5f9f587c5a3bc and https://regex101.com/r/C9cil2/7 – Jay Blanchard Jan 02 '20 at 14:54
  • Updated regex https://regex101.com/r/C9cil2/9 provided they will have this format. – Jay Blanchard Jan 02 '20 at 14:57
  • Hey, thanks, but it is not working for me, I've added a few next results to try: https://regex101.com/r/C9cil2/10, but it is not capturing groups by |+letter. As you can see group1 is only "|a", group2 is group1 + group2 together and group3 is almost ok, but it takes group letter ('c') inside. I know, that I can make it by str_pos and then take substring between |x - | or end, but I want to know, if regex can work here, as it should be quickier. – Petr Kateřiňák Jan 02 '20 at 15:20
  • Try this one https://regex101.com/r/C9cil2/11 – Jay Blanchard Jan 02 '20 at 15:38
  • Or this one https://regex101.com/r/C9cil2/12 – Jay Blanchard Jan 02 '20 at 15:39
  • Here is the result after updating your latest https://regex101.com/r/C9cil2/13 – Jay Blanchard Jan 02 '20 at 15:41
  • http://sandbox.onlinephpfunctions.com/code/78d5935bf860834b9b487f27f52a1f88d0cfe61c – Jay Blanchard Jan 02 '20 at 16:09
  • Jay, there is still a and b together... [2] => Grafický design : |b základní pravidla a způsoby jejich porušování / – Petr Kateřiňák Jan 02 '20 at 17:10
  • So they are not the same format each time? That will make things very difficult. – Jay Blanchard Jan 02 '20 at 17:12
  • https://regex101.com/r/C9cil2/15 but it doesn't work for all text that you have. Perhaps you can do the first one, test for the other condition and do the other one? – Jay Blanchard Jan 02 '20 at 17:18
  • 1
    I know, it is not easy as end of first group is also start of the next group and all are optional. I solved it with function and strpos, substr and strlen: http://sandbox.onlinephpfunctions.com/code/a74984aae9e7c832984889869c39524eb1db807d – Petr Kateřiňák Jan 02 '20 at 17:39
  • http://sandbox.onlinephpfunctions.com/code/7a3c80e7bbe81a6cd57cc695a6be46259ac82742 with one `strpos()` and switching patterns. New pattern https://regex101.com/r/C9cil2/16 – Jay Blanchard Jan 02 '20 at 18:04
  • BTW, all of this should've been in the original question. You had multiple patterns to compare and there is no way we could've know until you added your comments. – Jay Blanchard Jan 02 '20 at 18:07

0 Answers0