0

The question title says it all, after a bit of Googling and several days of tinkering with code, I cannot figure out how to download the plain text of a webpage.

Using strip_tags(); still leaves the JavaScript and CSS and trying to clean it up with regex also causes issues.

Is there any (simple or complicated) way to download a webpage (say a Wikipedia article) in plain-text using PHP?

I downloaded the page using PHP's file_get_contents(); as here:

$homepage = file_get_contents('http://www.example.com/');

As I said, I tried using strip_tags(); etc but I can't get the plain text.

I've tried using: http://millkencode.googlecode.com/svn/trunk/htmlxtractor/ContentExtractor.php to get the main content but it doesn't seem to work.

user115422
  • 4,662
  • 10
  • 26
  • 38

2 Answers2

2

This is not nearly as easy as it seems. I'd recommend looking on something like PHP Simple HTML DOM Parser. Aside from JavaScript and CSS being hard to remove (and using RegEx for HTML is not proper) there could still be some inline styling there and stuff like that.

This, of course, is relative to the complexity of the HTML. strip_tags could be sufficient in some cases.

Community
  • 1
  • 1
federico-t
  • 12,014
  • 19
  • 67
  • 111
  • I had that same link in the answer box and was about to hit Enter. Needless to say, I agree with this answer. – theftprevention Aug 03 '13 at 05:46
  • Right, that makes sense, thats usually where I get caught up. But isn't Simple HTML DOM Parser a way to manipulate with the elements? How would I extract all of the content that the user sees in a webpage and store it in a variable. I have tried another plugin if you would like me to add it to my question. Thanks for the answer! – user115422 Aug 03 '13 at 05:47
  • @user115422 I _think_ what you are looking for is echoing `file_get_html('http://www.example.com')->plaintext;`. (Of course, using the [Simple HTML DOM Parser](http://simplehtmldom.sourceforge.net/)). – federico-t Aug 03 '13 at 05:51
  • @campari `Fatal error: Call to a member function plaintext() on a non-object in /var/www/tests/dload-2.php on line 1` – user115422 Aug 03 '13 at 05:55
  • @user115422 Without the `()`. It's a variable, not a function (if I recall correctly). – federico-t Aug 03 '13 at 05:57
  • Aha! I found it. In fact, the answer was in the plugin I was trying to use all along (http://millkencode.googlecode.com/svn/trunk/htmlxtractor/ContentExtractor.php) The only thing was that the class was named incorrectly in the code which I simply edited. And it turns out that this plugin is capable of extracting the main body of the webpage but now there are still some simple tags to iron out, would strip_tags suffice? – user115422 Aug 03 '13 at 05:59
  • @campari Notice: Trying to get property of non-object in /var/www/tests/dload-2.php on line 1 – user115422 Aug 03 '13 at 06:00
  • @user115422 Sorry to ask, but have you installed it correctly? – federico-t Aug 03 '13 at 06:02
  • @campari, my bad, I got confused with file_get_contents. But Yones got the answer down for me. I really appreciate your help though. Thanks again! – user115422 Aug 03 '13 at 06:06
1

Use this code:

require_once('simple_html_dom.php');
$content=file_get_html('http://en.wikipedia.org/wiki/FYI');
$title=$content->find("#firstHeading",0)->plaintext ;
$text=$content->find("#bodyContent",0)->plaintext;
echo $title.$text;

http://simplehtmldom.sourceforge.net

ops
  • 2,023
  • 18
  • 21
  • I've tried http://millkencode.googlecode.com/svn/trunk/htmlxtractor/ContentExtractor.php is there anything I can work on from there? The extractor doesn't work for me. – user115422 Aug 03 '13 at 05:52
  • 1
    yep, thats what I'm looking for, I can finally start parsing the plaintext of a webpage! Thanks a million! Also (this may be a bit cheap) but does my question deserve a +1? I've been auto blocked before because of the lack of positive votes, I just don't want that happening again. – user115422 Aug 03 '13 at 06:05