I've succeeded in scraping images from direct Twitter links using Simple-HTML-Dom
Library.
Eg1: 'https://twitter.com/{userName}'
Eg2: 'https://twitter.com/{userName}/media'
Now the problem arises with Twitter API links
Eg: 'https://twitter.com/i/profiles/show/{userName}/timeline'
The last link returns a JSON file and I couldn't figure how to retrieve the data. I tried this
include_once('simple/simple_html_dom.php');
header('Content-type: application/json');
$html = file_get_html('https://twitter.com/i/profiles/show/{userName}/timeline');
echo $html;
That returns the JSON file. Now I need to retrieve
'data-image-url="img_link"'
Simple HTML Dom
Library function $html->find('[data-image-url]')
doesn't work.
I experimented DOMdocument
method like this ...
$dom = new DOMDocument();
$dom->loadHTMLFile('https://twitter.com/i/profiles/show/{userName}/timeline');
$str = $dom->saveHTML();
preg_match_all('/(data-image-url=)/', $str, $matches);
$imgs = [];
foreach ($matches[1] as $key) {
array_push($imgs, $key);
}
echo count($imgs);
It works but changing the regEx to capture the img src doesn't work
'/data-image-url=\\"([^\"]+)/'
I tried more variations with the RegEx but it doesn't work. Maybe because of the line breaks and quotes. The string looks like this ...
e-photo \"\n data-image-url=\"https:\/\/pbs.twimg.com\/media\/DzTW5ooWkAI7vQm.jpg\"\n \n \n data-element-
Is there any way around to achieve what I am trying to do? Or how can I modify the RegEx to get the links?