-1

I have a problem parse words from HTML table. I need to separate the words from other content ("lemma" column):

The original version of the page in Russian - http://hsu.su/st2

English (googletranslate) - http://hsu.su/155

I have heard of PHP Simple HTML DOM Parser http://simplehtmldom.sourceforge.net/ but I can not figure out how to solve this problem with him.

Bill the Lizard
  • 398,270
  • 210
  • 566
  • 880
user1103744
  • 2,451
  • 4
  • 20
  • 20

2 Answers2

1
<?php
    include_once('simplehtmldom/simple_html_dom.php');
    $html = file_get_html('http://dict.ruslang.ru/freq.php?act=show&dic=freq_news_comp&title=%D1%EB%EE%E2%E0%F0%FC%20%E7%ED%E0%F7%E8%EC%EE%E9%20%E3%E0%E7%E5%F2%ED%EE-%ED%EE%E2%EE%F1%F2%ED%EE%E9%20%EB%E5%EA%F1%E8%EA%E8');

    $myFile = "file.txt";
    $fh = fopen($myFile, 'w') or die("can't open file");


    $table=$html->find('table',1);
    foreach($table->find('td') as $td)
    fwrite($fh, $td->plaintext);

    fclose($fh);
    ?>

Download simplehtmldom from the same link you provided..

copy it in the same folder

make sure the path inluded in the code refers to right class

make file.txt file in same folder..

and run the code...

You have

 '&nbsp;'

extra which you can remove from php string functions..

Rajat Singhal
  • 11,234
  • 5
  • 38
  • 56
-1

Check out the PHP function strip_tags().

Jeremy Harris
  • 24,318
  • 13
  • 79
  • 133
  • `strip_tags` will remove the tags. This would leave the OP still with the problem of how to get the data from the - now unstructured - text. – Gordon Jan 07 '12 at 15:31