1

I am trying to extract from a string (whole website source fetched by CURL - )

<tr>
    <td><a href="http://www.gpw.pl/karta_spolki/LT0000128555/">AAL</a></td>
<td><a href="http://www.gpw.pl/karta_spolki/LT0000128555/">AVIAAM LEASING AB</a></td>
</tr>
<tr class="even">
    <td><a href="http://www.gpw.pl/karta_spolki/PLTRNSU00013/">AAT</a></td>
    <td><a href="http://www.gpw.pl/karta_spolki/PLTRNSU00013/">ALTA SPÓŁKA AKCYJNA</a></td>

And I would like to get all 3-character anchors to be matched in an array for example AAL and AAT (there are more)

What I have is:

$subject = curl_exec($ch);        
$pattern = '`<td><a href="http://www\.gpw\.pl/karta_spolki/[0-9A-Za-z ]+/">[0-9A-Z]{3}</a></td>`';
preg_match_all($pattern, $subject, $matches, PREG_PATTERN_ORDER);
print_r($matches);

As a result I get

Array ( [0] => Array ( ) ) 

Could you give me any advice how to resolve it?

Giacomo1968
  • 25,759
  • 11
  • 71
  • 103
andrewpo
  • 361
  • 1
  • 6
  • 14
  • 2
    Have you considered using DOM tool (DOMDocument) or similar? – Mike Brant Jun 09 '14 at 20:52
  • i haven't - i thought it would be more convenient this way (regex) – andrewpo Jun 09 '14 at 20:53
  • 1
    Regex isn't great for HTML. There's a rather funny answer [here](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/) – user184994 Jun 09 '14 at 20:55
  • I have to give it a try - thank you – andrewpo Jun 09 '14 at 20:56
  • @andrewpo Typically, using regex to extract data from an HTML/DOM structure is an exercise in frustration. HTML is not a markup style that is very conducive to regex, unless you have a document structure that is well known and not subject to change and you are looking for something very simple from it. You would probably be best suited to learn DOM manipulation tools, which are drastically more useful for getting data from such documents. – Mike Brant Jun 09 '14 at 20:56
  • 2
    Your `preg_match_all` code is working fine for me. I'm getting 2 anchors, the AAL, and the AAT. Make sure of the content you're getting from the `curl_exec`. – bloodyKnuckles Jun 09 '14 at 21:11
  • For your pattern, try `/(?<=>)(.{3})(?=<)/`. – OnlineCop Jun 09 '14 at 21:34
  • @bloodyKnuckles It works well with the sample code, but fails with the actual page code. Check out my answer. I redid the regex to work with the actual HTML content. – Giacomo1968 Jun 09 '14 at 23:15

2 Answers2

1

You could use a DOMDocument object to build your array like this:

$doc = new DOMDocument();
$doc->LoadHTML($str);

$matches = array();
foreach($doc->getElementsByTagName('a') as $a) {
    $text = $a->nodeValue;
    if(strlen($text) === 3) $matches[] = $text;
}

This will iterate through all of the anchor elements in your HTML string and build this array:

Array
(
    [0] => AAL
    [1] => AAT
)
Tom Fenech
  • 72,334
  • 12
  • 107
  • 141
  • `DOMDocument` is useful if one can actually **fetch** the content via a `curl` call. Check my answer. Core content on the page the original poster is trying to fetch is being loaded via AJAX. So the DOM structure is just an empty frame before the AJAX. Which means the regex will fail & DOM document will fail. – Giacomo1968 Jun 09 '14 at 22:06
  • @JakeGould fair enough, I didn't go as far as to visit the pages that were linked to as the apparent content was supplied in the question. It seems like you've spotted the cause of the problem but is there a solution? – Tom Fenech Jun 09 '14 at 22:19
  • So I thought it was AJAX at first. But then I believe I found the actual URL being used. And the regex indeed fails. So I created my own `curl` call & I redid the regex to work with the actual HTML content. Works now! – Giacomo1968 Jun 09 '14 at 23:17
1

I just tried your example & your regex works as expected with the small sample provided:

$subject = <<<EOT
<tr>
    <td><a href="http://www.gpw.pl/karta_spolki/LT0000128555/">AAL</a></td>
<td><a href="http://www.gpw.pl/karta_spolki/LT0000128555/">AVIAAM LEASING AB</a></td>
</tr>
<tr class="even">
    <td><a href="http://www.gpw.pl/karta_spolki/PLTRNSU00013/">AAT</a></td>
    <td><a href="http://www.gpw.pl/karta_spolki/PLTRNSU00013/">ALTA SPÓŁKA AKCYJNA</a></td>
EOT;

$pattern = '`<td><a href="http://www\.gpw\.pl/karta_spolki/[0-9A-Za-z ]+/">[0-9A-Z]{3}</a></td>`';
preg_match_all($pattern, $subject, $matches, PREG_PATTERN_ORDER);

echo '<pre>';
print_r($matches);
echo '</pre>';

The results:

Array
(
    [0] => Array
        (
            [0] => AAL
            [1] => AAT
        )

)

But that said, I actually dug up what I believe is your source URL for the curl request, and it fails when I test it. So I adjusted the regex to this:

/(?<=>)[0-9A-Z]{3}(?=<\/a><\/td>)/is

And now things seem to work well together with my code that attempts to recreate the curl request you are making.

// Set the URL.
$url="http://www.gpw.pl/lista_spolek_en";

// The actual curl request.
$curl_timeout = 5;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $curl_timeout);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$subject = curl_exec($ch);
curl_close($ch);

// Set the regex pattern.
$pattern = '/(?<=>)[0-9A-Z]{3}(?=<\/a><\/td>)/is';

// Run the preg match all command with the regex pattern.
preg_match_all($pattern, $subject, $matches, PREG_PATTERN_ORDER);

// Return the results.
echo '<pre>';
print_r($matches);
echo '</pre>';

And the output from that seems to work well from my perspective:

Array
(
    [0] => Array
        (
            [0] => AAL
            [1] => AAT
            [2] => ABC
            [3] => ABE
            [4] => ABM
            [5] => ABS
            [6] => ACE
            [7] => ACG
            [8] => ACP
            [9] => ACS
            [10] => ACT
            [11] => ADS
            [12] => AGO
            [13] => AGT
            [14] => ALC
            [15] => ALM
            [16] => ALR
            [17] => ALT
            [18] => AMB
            [19] => AMC
            [20] => APL
            [21] => APN
            [22] => APT
            [23] => ARC
            [24] => ARR
            [25] => ASB
            [26] => ASE
            [27] => ASG
            [28] => AST
            [29] => ATC
            [30] => ATD
            [31] => ATG
            [32] => ATL
            [33] => ATM
            [34] => ATP
            [35] => ATR
            [36] => ATS
            [37] => AWB
            [38] => AWG
            [39] => EAT
            [40] => ACP
            [41] => ALR
            [42] => BZW
            [43] => EUR
            [44] => JSW
            [45] => KER
            [46] => KGH
            [47] => LPP
            [48] => LTS
            [49] => LWB
            [50] => MBK
            [51] => OPL
            [52] => PEO
            [53] => PGE
            [54] => PGN
            [55] => PKN
            [56] => PKO
            [57] => PZU
            [58] => SNS
            [59] => TPE
        )

)
Giacomo1968
  • 25,759
  • 11
  • 71
  • 103
  • Well done for getting it working but I still don't think that using a regular expression is the way to go here. If you want all anchor tags whose content is three characters long, you may as well ask for that directly. – Tom Fenech Jun 10 '14 at 08:06