0

Starting from this html page:

https://www.sports-reference.com/olympics/summer/1896/ATH/

I'm trying to get some information with the following script:

<?php
include_once ('C:\moduli\simple_html_dom.php');


    function getTextBetweenTags($url, $tagname) {
    $values = array();
    $html = file_get_html($url);
    foreach($html->find($tagname) as $tag) {

        //echo $tag;

        foreach($tag->find('a') as $a) {

            //echo $a;

            $values[] = $a->innertext. '<br>';
            //echo $values[0];

    }
    print_r ($values);
    unset($values);
    }

    //$result=explode("'s",$values[0]);
    //array_pop($result);
    //return $result;

}

$output = getTextBetweenTags('https://www.sports-reference.com/olympics/summer/1896/ATH/', 'tr  class=""');
//echo '<pre>';

?>

What I get from the print_r array inside the loop is the following (only first rows):

Array ( ) Array ( [0] => Men's 100 metres
[1] => Tom Burke
[2] => Fritz Hofmann
[3] => Alajos Szokoly
[4] => Frank Lane
) Array ( [0] => Men's 400 metres
[1] => Tom Burke
[2] => Herbert Jamison
[3] => Charles Gmelin
) Array ( [0] => Men's 800 metres
[1] => Teddy Flack
[2] => Nándor Dáni
[3] => Dimitrios Golemis
) Array ( [0] => Men's 1,500 metres
[1] => Teddy Flack
[2] => Arthur C. Blake
[3] => Albin Lermusiaux

I'd like to store in separated variables (for example for 100 metres):

100 metres
Men
Tom Burke
USA --> (this one taken from "alt" attribute inside html)
Gold --> (static parameter for the first athlete)

then reset all and get for second loop

100 metres
Men
Fritz Hofmann
GER --> (this one taken from "alt" attribute inside html)
Silver --> (static parameter for the second athlete)

for the last two athletes, both won bronze so I'd like to get:

    100 metres
    Men
    Alajos Szokoly
    HUN --> (this one taken from "alt" attribute inside html)
    Bronze --> (static parameter for the third athlete)

and

        100 metres
        Men
        Frank Lane
        USA --> (this one taken from "alt" attribute inside html)
        Bronze --> (static parameter for the fourth athlete)

Last two athletes are recognizible because in html they are on the same row of td align="left" attribute.

How to get that? Thank you

Idro
  • 253
  • 1
  • 7

1 Answers1

1

This should work for you:

function getTextBetweenTags($url, $tagname) 
{
    $values = array();
    $html = file_get_html($url);
    foreach($html->find($tagname) as $tag)
    {
        //echo $tag;
        $row = array();
        foreach($tag->find('td') as $td)
        {
            $a_tags = $td->find('a');
            if(count($a_tags) ==0)
            {
                $val ="";
            }
            elseif(count($a_tags)==1)
            {               
                $val = $a_tags[0]->innertext. '<br>';
            }
            else
            {
                $val = array();
                foreach($a_tags as $a)
                {
                    $val[] = $a->innertext. '<br>';
                }
            }
            $values[] = $val;
        }
        print_r ($values);
    unset($values);
    }

}

This outputs the array in this format:

Array
(
    [0] => Men's 100 metres<br>
    [1] => Tom Burke<br>
    [2] => Fritz Hofmann<br>
    [3] => Array
        (
            [0] => Alajos Szokoly<br>
            [1] => Frank Lane<br>
        )

)
Array
(
    [0] => Men's 400 metres<br>
    [1] => Tom Burke<br>
    [2] => Herbert Jamison<br>
    [3] => Charles Gmelin<br>
)
Amit Joshi
  • 1,334
  • 1
  • 8
  • 10
  • It's ok but there is a particular case in which it does not work. When an athlete did not won a medal I get the next event as third element of the previous array, indeed it should be a new array. – Idro Aug 10 '17 at 20:49
  • Can you give example of that HTML? – Amit Joshi Aug 10 '17 at 21:02
  • The HTML is the same as above. I mean, if you look at the output, the Men's 110 metres Hurdles event has only two athletes who won gold and silver medal, no one won bronze. Well, the following event, Men's High Jump, starts from the third element of Men's 110 metres Hurdles and not just as a new array as the other ones. I hope my explanation was quite clear. – Idro Aug 10 '17 at 21:15
  • The code works just fine for the situation you described. Check out the array it creates here: https://prnt.sc/g72aox . You can see that a new array is created for each row and if there is no entry for a medal, then there is an empty array element for that. For example,. for `Men's 110 metres Hurdles`, `$array[3]` is empty then the the next array begins with `Men's Pole Vault` in its 0 position. Each array begins with the name of the sport. – Amit Joshi Aug 11 '17 at 03:09
  • Yes, you're right, I'm sorry. I was wrong to indent the output. Another thing: I was seeing, since the 0 position is always Men's Pole vault, or Men's 110 metres Hurdles, if it is possible explode it and create a sub array composed by two elements: first one by Pole Vault or 110 metres hurdles for example, and second one by Men. – Idro Aug 11 '17 at 04:44
  • @AmitJoshi Please take the time to explain your code only answer. The OP is not the only person who will read this page. Please do your best to educate future SO readers about your approach. – mickmackusa Aug 20 '17 at 13:00