Regex Match Html Tag and Inner Html Pattern

Question

I've scraped a web page and I'm trying to extract data from a td that has no class or ids. Let's say I have the following html:

<table> 
    <tr>
        <td>Title</td>
        <td>The Harsh Face of Mother Nature</td>
        </tr>
        .
        .
        .
</table>

I'm trying to do a preg_match:

$title = preg_match("\(>Title)(.*?)(?=<\/td\>{2})\", $html);

My pattern starts with >Title and the ending is the 2nd occurrence of </td>.

I've been working with https://regex101.com/ to try to figure this out, but regex is really tough! Especially with the obscure stuff I'm trying to accomplish. Any help, please? Thanks!

(edit below:) The goal is to get a sting like: </td><td>The Harsh Face of Mother Nature then to do another preg match to remove the first part and have the final product of The Harsh Face of Mother Nature

are you ok with parsing html files with regex? What would be your expected output? — Avinash Raj, Mar 25 '15 at 03:48
What are you trying to capture? Just two td tags or more than two? Do you know forsure the first contains title? — Moishe Lipsker, Mar 25 '15 at 04:00

score 1 · Accepted Answer · answered Mar 25 '15 at 06:52

1

Try another technique: >Title.*?(?=<td>)<td>\K.*?(?=<\/td>)

$re = "/>Title.*?(?=<td>)<td>\\K.*?(?=<\\/td>)/s";
$str = "<table> \n <tr>\n <td>Title</td>\n <td>The Harsh Face of Mother Nature</td>\n <td>The Harsh Face of Mother Nature</td>\n </tr>\n .\n .\n .\n</table>";

preg_match_all($re, $str, $matches);

Demo

answered Mar 25 '15 at 06:52

Ahosan Karim Asik

3,219
1
18
27

This seems to work in the regex101, thank you. Sadly though, I have a problem making my $crawler object a string, so I cannot apply this method until I have that problem solved. (http://stackoverflow.com/questions/29267492/domcrawler-not-dumping-data-properly-for-parsing) – Kenny Mar 25 '15 at 22:11

score 0 · Answer 2 · answered Mar 25 '15 at 04:03

You could use the below regex in preg_match or preg_match_all

>Title.*?<\/td>.*?<td>\K.*?(?=<\/td>)

DEMO

$re = "/>Title.*?<\\/td>.*?<td>\\K.*?(?=<\\/td>)/s";
$str = "<table> \n <tr>\n <td>Title</td>\n <td>The Harsh Face of Mother Nature</td>\n </tr>\n .\n .\n .\n</table>";
preg_match_all($re, $str, $matches);

Moishe Lipsker · Answer 3 · 2015-03-25T04:34:14.630

0

You could try this regex .*\<table\>\s*\<tr\>\s*\s*\<td\>title\<\/td>\s*\<td\>((\w*\s*\w*)*)<\/td>.* It will capture in the first group the content of the <td> tag that follows the <td>title</td>, which comes after a <table> tag.

edited Mar 25 '15 at 04:34

answered Mar 25 '15 at 04:13

Moishe Lipsker

2,974
2
21
29

score 0 · Answer 4 · edited Mar 25 '15 at 08:08

0

use js n-th child property to get it

$( "table tr td:nth-child(2)" )

edited Mar 25 '15 at 08:08

Peter Bowers

3,063
1
10
18

answered Mar 25 '15 at 06:13

Neethu George

587
5
8

I can't, the web page has many, many tables, and each table is filled dynamically, so no knowing how many rows are in it. – Kenny Mar 25 '15 at 06:15

Regex Match Html Tag and Inner Html Pattern

4 Answers4