PHP - manage curl output

Question

based on my last question, i sent request to website and it show me output. But, output show me the full website. i want get only some data like link in curl output.

$url = 'http://site1.com/index.php';
$data = ["send" => "Test"];
$ch = curl_init($url);

curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);

$response = curl_exec($ch);
curl_close($ch);
var_dump($response);

this code show me what i want , but the output contain full website. i just want get some data and show in out put.

What do you mean by "I just want get some data"? Where do you filter the output by anything? — Nico Haase, Apr 01 '20 at 15:41
@vivek_23 response show me the website with data. i just want data. not fully website show on screen — William, Apr 01 '20 at 15:42
if the website is yours, you can create separate endpoint which will return, what ever you need (eg:http://site1.com/index-curl.php) If the website is not yours, you will have to use a web scraping script to filter out the response. The following link might be helpful for you to write a scraper https://stackoverflow.com/questions/9813273/web-scraping-in-php — Thushan, Apr 01 '20 at 15:42
@NicoHaase i don't know how to filter. the data that i want is in some html class. — William, Apr 01 '20 at 15:45

S. Imp · Answer 1 · 2020-04-01T16:27:18.523

1

You can use preg_match_all and a carefully constructed pattern. This modified version of your code should give you a list of all the image urls in the HTML that you retrieve:

        $url = 'http://site1.com/index.php';
        $data = ["send" => "Test"];
        $ch = curl_init($url);

        curl_setopt($ch, CURLOPT_POST, 1);
        curl_setopt($ch, CURLOPT_POSTFIELDS, $data);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

        $response = curl_exec($ch);
        curl_close($ch);


        $matches = NULL;
        $pattern = '/<img[^>]+src=\"([^"]+)"[^>]*>/';
        $img_count = preg_match_all($pattern, $response, $matches);

        var_dump($matches[1]);

If you'd like to fetch all the links instead, you can change $pattern to this:

        $pattern = '/<a[^>]+href=\"([^"]+)"[^>]*>/';

I have tested this code on an html file that looks like this:

<html>
<body>
<div><img src="WANT-THIS"></div>
</body>
</html>

And the output is this:

array(1) {
  [0]=>
  string(9) "WANT-THIS"
}

EDIT 2: In response to additional questions from the OP, I have also tried the script on this html file:

<html>
<body>
<div1>CODE</div><div2>CODE</div><div3>CODE</div><div4>CODE</div><div5>CODE</div><div6>CODE</div><img src="IMAGE">
</body>
</html>

And it produces this result:

array(1) {
  [0]=>
  string(5) "IMAGE"
}

If this doesn't solve your problem, you'll need to provide additional detail -- either an example url that you are fetching, some HTML that you want to search, or extra detail about how you might know which image in the HTML you want to grab -- does it have some special id? Is it always the first image? The second image? Is there any characteristic by which we know which image to grab?

edited Apr 01 '20 at 16:27

answered Apr 01 '20 at 15:51

S. Imp

2,833
11
24

1

I wouldn't recommend regex matching on HTML response. – nice_dev Apr 01 '20 at 15:54
why not? It gives you a list of images in the page. – S. Imp Apr 01 '20 at 15:54
See this https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – nice_dev Apr 01 '20 at 15:56
1

@vivek_23 Not sure that long and sanctimonious post deserves so many upvotes in the first place. Secondly, i'm not writing a 'parser' -- the objective is very limited here. – S. Imp Apr 01 '20 at 15:58
@S.Imp Thank you. it show me some image link. but i cant find image link that i want. image inside
and the url is for : imgur – William Apr 01 '20 at 16:01
1

You should not use regex for this as it's a bad practice regardless of how big or small the objective is. Also, in your regex, you are expecting the src to be under double quotes, it could be under single quotes as well. Also, if I am not mistaken, it could collide with `data-src` attribute as well. – nice_dev Apr 01 '20 at 16:06
@William, you'll need to be more specific about your url or the HTML you are searching and the image you want to grab. – S. Imp Apr 01 '20 at 16:07
@vivek_23 you are correct that my regex assumes double quotes, but that's what the OP asked for. You are also correct that it might miss a div tag with data-src, but that is *not* what the OP asked for. It may be a 'bad practice' according to some, but regex can be very effective in extracting information from html, especially when the HTML is poorly formed. – S. Imp Apr 01 '20 at 16:09
@S.Imp I see you edited. what is the pattern for "
" ? i want test that it show me or not. – William Apr 01 '20 at 16:12
@William The code above is the code I used. I simply created an html file on my workstation and changed the url to point to that: `$url = 'http://localhost/file.html';` – S. Imp Apr 01 '20 at 16:14
@S.Imp this pattern only show tag without any div. the tag that i want, used after 6
. i mean CODE
CODECODECODECODECODE – William Apr 01 '20 at 16:23
@William I have also tried the exact same code on your new HTML example and it should still grab IMAGE from your example. Perhaps you should inspect the HTML source you are searching more closely. The markup coming from the remote site may look very different than what you are asking for. It may not even use an IMG tag, as vivek_23 points out. – S. Imp Apr 01 '20 at 16:28
@S.Imp thank you. i see at the end of
tag , tag inside tag. like this : . do you have any idea?
– William Apr 01 '20 at 16:34
@S.Imp my bad. there is 6
inside together . Full ode like this :
. – William Apr 01 '20 at 16:46
@William it shouldn't make any difference how many div tags are in there, whether they are nested inside each other or displayed in sequence. If it's "not working" then you probably need to supply the actual HTML you are searching and you need to explain what is "not working." – S. Imp Apr 01 '20 at 16:52
@S.Imp thank you. 1. In your opinion why didn't it work? 2. can i search for specific link? image contain specific url. contain : imgur.com website. do you have any idea? and – William Apr 01 '20 at 17:17
@William you haven't even explained what "didn't work" means in this case. Did you get an error? What was the output? Was it the wrong image? If you want a different image, can you tell us more about that image so we know how to distinguish them? What output did the script make when you ran it? Have you inspected the markup and verified that the image is, in fact, displayed with an IMG tag? Or is it displayed via CSS as a background image on a div? I cannot possibly speculate without more information. – S. Imp Apr 01 '20 at 19:21
@S.Imp the code i tried, return all image of website except what i want. the image inside . before the , also have multi div tag. each div tag have a style like :
– William Apr 01 '20 at 19:42
@William as I've said a few times, you'll need to provide more precise detail about what HTML code you are searching. It may be that the image IMG tag is generated and loaded by Javascript. You should post the raw HTML retrieved by your curl statement (or supply the url so we can fetch it ourselves) and specifically identify which image you are trying to isolate. If that image is not easily distinguishable from the other images, you'll need to figure out some way to determine which image to grab. – S. Imp Apr 02 '20 at 02:02
1

@vivek_23 So you would prefer no solution than a solution that might be tweaked to work? – Ringo Apr 06 '20 at 18:16
@Ringo I would prefer a solution done in the right way. – nice_dev Apr 06 '20 at 21:42

PHP - manage curl output

1 Answers1