Extracting IDs from txt file using regex and php

Question

I have spent over 2 hours trying to get this to work I want to extract the

values between ":"

and ","eng_data&

the txt is here http://fdguirhgeruih.x10.mx/html.txt

The output should be a list of over 300 IDs but I only get one

http://fdguirhgeruih.x10.mx/extract.php

when I run the script

 <? php

    //First, open the file. Change your filename
    $file = "http://fdguirhgeruih.x10.mx/html.txt";
    $word1='&quot;:&quot;';
    $word2='&quot;,&quot;eng_data&';


    $contents = file_get_contents($file);

    $between=substr($contents, strpos($contents, $word1), strpos($contents, $word2) - strpos($contents, $word1));

    echo $between; 


    ?>

score 3 · Answer 1 · answered Oct 30 '11 at 23:41

3

This looks like a standard XML file.
use simpleXML to parse it instead of regexp

answered Oct 30 '11 at 23:41

Itay Moav -Malimovka

52,579
61
190
278

score 1 · Answer 2 · answered Oct 30 '11 at 23:45

1

The content is HTML, not XML as first answer noted. Use the simple html dom parser.

answered Oct 30 '11 at 23:45

davidethell

11,708
6
43
63

+1 but the native PHP DOM library would be a better option. Seen a lot of [negative reviews](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html-with-php/3577662#3577662) of Simple Html DOM Parser – Phil Oct 30 '11 at 23:50
Yes, native DOM could be better than the Simple HTML DOM Parser. I don't know how well it is maintained as I haven't needed it in a while. – davidethell Oct 30 '11 at 23:54
@itay not always true. XHTML is XML but if you look at his source document it has many invalid tags as far as XML is concerned. For example, the img tags have no closure as is required in valid XML. – davidethell Oct 31 '11 at 02:21

Extracting IDs from txt file using regex and php

2 Answers2