-1

I've scraped a web page and I'm trying to extract data from a td that has no class or ids. Let's say I have the following html:

<table> 
    <tr>
        <td>Title</td>
        <td>The Harsh Face of Mother Nature</td>
        </tr>
        .
        .
        .
</table>

I'm trying to do a preg_match:

$title = preg_match("\(>Title)(.*?)(?=<\/td\>{2})\", $html);

My pattern starts with >Title and the ending is the 2nd occurrence of </td>.

I've been working with https://regex101.com/ to try to figure this out, but regex is really tough! Especially with the obscure stuff I'm trying to accomplish. Any help, please? Thanks!

(edit below:) The goal is to get a sting like: </td><td>The Harsh Face of Mother Nature then to do another preg match to remove the first part and have the final product of The Harsh Face of Mother Nature

Kenny
  • 2,124
  • 3
  • 33
  • 63

4 Answers4

1

Try another technique: >Title.*?(?=<td>)<td>\K.*?(?=<\/td>)

$re = "/>Title.*?(?=<td>)<td>\\K.*?(?=<\\/td>)/s";
$str = "<table> \n <tr>\n <td>Title</td>\n <td>The Harsh Face of Mother Nature</td>\n <td>The Harsh Face of Mother Nature</td>\n </tr>\n .\n .\n .\n</table>";

preg_match_all($re, $str, $matches);

Demo

Ahosan Karim Asik
  • 3,219
  • 1
  • 18
  • 27
  • This seems to work in the regex101, thank you. Sadly though, I have a problem making my $crawler object a string, so I cannot apply this method until I have that problem solved. (http://stackoverflow.com/questions/29267492/domcrawler-not-dumping-data-properly-for-parsing) – Kenny Mar 25 '15 at 22:11
0

You could use the below regex in preg_match or preg_match_all

>Title.*?<\/td>.*?<td>\K.*?(?=<\/td>)

DEMO

$re = "/>Title.*?<\\/td>.*?<td>\\K.*?(?=<\\/td>)/s";
$str = "<table> \n <tr>\n <td>Title</td>\n <td>The Harsh Face of Mother Nature</td>\n </tr>\n .\n .\n .\n</table>";
preg_match_all($re, $str, $matches);
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
0

You could try this regex .*\<table\>\s*\<tr\>\s*\s*\<td\>title\<\/td>\s*\<td\>((\w*\s*\w*)*)<\/td>.* It will capture in the first group the content of the <td> tag that follows the <td>title</td>, which comes after a <table> tag.

Moishe Lipsker
  • 2,974
  • 2
  • 21
  • 29
0

use js n-th child property to get it

$( "table tr td:nth-child(2)" )
Peter Bowers
  • 3,063
  • 1
  • 10
  • 18
Neethu George
  • 587
  • 5
  • 8
  • I can't, the web page has many, many tables, and each table is filled dynamically, so no knowing how many rows are in it. – Kenny Mar 25 '15 at 06:15