0

I've succeeded in scraping images from direct Twitter links using Simple-HTML-Dom Library.

Eg1: 'https://twitter.com/{userName}' 
Eg2: 'https://twitter.com/{userName}/media'

Now the problem arises with Twitter API links

Eg: 'https://twitter.com/i/profiles/show/{userName}/timeline'

The last link returns a JSON file and I couldn't figure how to retrieve the data. I tried this

include_once('simple/simple_html_dom.php');
header('Content-type: application/json');
$html = file_get_html('https://twitter.com/i/profiles/show/{userName}/timeline'); 
echo $html;

That returns the JSON file. Now I need to retrieve

'data-image-url="img_link"'

Simple HTML Dom Library function $html->find('[data-image-url]') doesn't work.

I experimented DOMdocument method like this ...

$dom = new DOMDocument();
$dom->loadHTMLFile('https://twitter.com/i/profiles/show/{userName}/timeline');
$str = $dom->saveHTML();
preg_match_all('/(data-image-url=)/', $str, $matches);
$imgs = [];
foreach ($matches[1] as $key) {
   array_push($imgs, $key);
}
echo count($imgs);

It works but changing the regEx to capture the img src doesn't work

'/data-image-url=\\"([^\"]+)/'

I tried more variations with the RegEx but it doesn't work. Maybe because of the line breaks and quotes. The string looks like this ...

e-photo \"\n data-image-url=\"https:\/\/pbs.twimg.com\/media\/DzTW5ooWkAI7vQm.jpg\"\n \n \n data-element-

Is there any way around to achieve what I am trying to do? Or how can I modify the RegEx to get the links?

Jeeva Raam
  • 47
  • 1
  • 10
  • You know Twitter offers real APIs so you don't have to try and scrape HTML? – Phil Feb 14 '19 at 07:12
  • Doesn't the link `https://twitter.com/i/profiles/show/{userName}/timeline` a real Twitter API? I know JSON, I got hold on to the sting. It's the RegEx I am struggling with to retrieve the image links. – Jeeva Raam Feb 14 '19 at 09:39

0 Answers0