11

So, I'm working on a PHP script, and part of it needs to be able to query a website, then get text from it.

First off, I need to be able to query a certain website URL, then I need to be able to get text from the text from that website after the query, and be able to return that text out of the function.

How would I query the website and get the text from it?

Alper
  • 1
  • 12
  • 39
  • 78

7 Answers7

14

The easiest way:

file_get_contents()

That will get you the source of the web page.

You probably want something a bit more complete though, so look into cURL, for better error handling, and setting user-agent, and what not.

From there, if you want the text only, you are going to have to parse the page. For that, see: How do you parse and process HTML/XML in PHP?

Community
  • 1
  • 1
Brad
  • 159,648
  • 54
  • 349
  • 530
9

I would do a dom search, take a look at http://www.php.net/manual/es/domdocument.load.php Domxpath might be very useful too: http://php.net/manual/en/class.domxpath.php

$doc = new DOMDocument;
$doc->load("http://mysite.com");
$xpath = new DOMXpath($doc);
$elements = $xpath->query("*/div[@id='yourTagIdHere']");
Erick Martinez
  • 805
  • 1
  • 9
  • 11
0

Can this be done by getting all of the content from the webpage utilizing methods already listed above, and then using regex to remove all characters between open and closed brackets?

A page that looks like this:

<html><style> h1 { font-style:... }</style><h1>stuff in here</h1></html>

Would then become this after regex:

h1 { font-style:... }stuff in here

And because we want to remove all of the code in between various tags such as the [style] tag, we could then first use regex to remove all characters between [style and /style] so that we are just left with:

stuff in here

Would this work then? Please reply if you think it would or if you foresee errors as I would like to create a tool with this parsing.

Michael d
  • 305
  • 2
  • 16
0

You can use file_get_contents or if you need a little more control (i.e. to submit POST requests, to set the user agent string, ...) you may want to look at cURL.

file_get_contents Example:

$content = file_get_contents('http://www.example.org');

Basic cURL Example:

$ch = curl_init('http://www.example.org');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7) AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1 Safari/534.48.3');

$content = curl_exec($ch);

curl_close($ch);
Francois Deschenes
  • 24,816
  • 4
  • 64
  • 61
0

If you have Curl installed, use it. Otherwise:

$website = file_get_contents('http://google.com');

Then you need to search through the string for the text you want. How you do that depends on the website, and the text you're trying to read.

Paul
  • 139,544
  • 27
  • 275
  • 264
0

you need to use CURL. You can get some samples here

TheTechGuy
  • 16,560
  • 16
  • 115
  • 136
0

If you want more control, use cURL. Otherwise: file_get_contents..

$url  = "http://www.example.com/test.php";  // Site URL.
$site = file_get_contents($url);             // Gets site response.
Mingle
  • 846
  • 6
  • 13