I try to get image urls from a list of html urls with following curl/grep/seed combination (with wget
I fail with 403, but cUrl
get the source code correctly):
curl -K "C:\urls.txt" | "C:\GnuWin32\bin\grep.exe" -o '(http[^\s]+(jpg|png|webp)\b)' | sed 's/\?.*//' > imglinks.txt
But I get an error The command "png" is either misspelled or could not be found.
Regex should be correct: https://regex101.com/r/Qk6A0Z/1/
How could this code be improved?
Edit: the source code of a single url from my list one can see running curl https://watchbase.com/sellita
The snippet, from where I want to get image urls looks like
<picture>
<source type="image/webp" data-srcset="https://cdn.watchbase.com/caliber/md/origin:png/sellita/sw200-1-bd.webp" srcset="https://assets.watchbase.com/img/FFFFFF-0.png" />
<img class="lazyload" data-src="https://cdn.watchbase.com/caliber/md/sellita/sw200-1-bd.png" src="https://assets.watchbase.com/img/FFFFFF-0.png" alt="Sellita caliber SW200-1"/>
</picture>
Expected output is a file with all image urls, even those from data-src
and data-srcset
.