1

I try to get image urls from a list of html urls with following curl/grep/seed combination (with wget I fail with 403, but cUrl get the source code correctly):

curl -K "C:\urls.txt" | "C:\GnuWin32\bin\grep.exe" -o '(http[^\s]+(jpg|png|webp)\b)' | sed 's/\?.*//' > imglinks.txt

But I get an error The command "png" is either misspelled or could not be found.

Regex should be correct: https://regex101.com/r/Qk6A0Z/1/

How could this code be improved?

Edit: the source code of a single url from my list one can see running curl https://watchbase.com/sellita

The snippet, from where I want to get image urls looks like

<picture>
<source type="image/webp" data-srcset="https://cdn.watchbase.com/caliber/md/origin:png/sellita/sw200-1-bd.webp" srcset="https://assets.watchbase.com/img/FFFFFF-0.png" />
<img class="lazyload" data-src="https://cdn.watchbase.com/caliber/md/sellita/sw200-1-bd.png" src="https://assets.watchbase.com/img/FFFFFF-0.png" alt="Sellita caliber SW200-1"/>
</picture>

Expected output is a file with all image urls, even those from data-src and data-srcset.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Evgeniy
  • 2,337
  • 2
  • 28
  • 68
  • You are doing something very wrong, why use `curl -K 'C:\urls.txt' | grep -o pattern`? You can simply use `"C:\GnuWin32\bin\grep.exe" -oP "http[^?\s]+(jpg|png|webp)\b" "C:\urls.txt" > imglinks.txt` – Wiktor Stribiżew Jun 21 '21 at 15:07
  • On this way I get only empty file imglinks.txt. If I use instead of file list a single url, like `"C:\GnuWin32\bin\grep.exe" -oP "http[^?\s]+(jpg|png|webp)\b" https://watchbase.com/sellita > imglinks.txt` I get `no such file or directory` – Evgeniy Jun 21 '21 at 15:20
  • Can you show output of your `curl` command and also show your expected final output. – anubhava Jun 21 '21 at 15:27
  • Just tried, and `"C:\GnuWin32\bin\grep.exe" -oE "http[^?[:space:]]+(jpg|png|webp)\b" "C:\urls.txt"` works well. Same as `"C:\GnuWin32\bin\grep.exe" -oP "http[^?\s]+(jpg|png|webp)\b" "C:\urls.txt"` – Wiktor Stribiżew Jun 21 '21 at 15:31
  • At any rate, the `'png' is not recognized as an internal or external command, operable program or batch file` issue is due to the use of single quotation marks. Use double. – Wiktor Stribiżew Jun 21 '21 at 15:36

3 Answers3

3

You may try this xargs+curl+grep pipeline:

xargs -n 1 curl < "C:\urls.txt" | "C:\GnuWin32\bin\grep.exe" -Eo "http[^[:blank:]?'\"]+(jpe?g|png|gif|bmp|ico|tiff|webp)\b" > imglinks.txt
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • This line worked `xargs curl < "C:\urls.txt" | "C:\GnuWin32\bin\grep.exe" -Eo "http[^?[:space:]]+(jpg|png|webp)\b" > imglinks.txt`, but there were only 10 urls processed from 33.000 urls, contained in the file `urls.txt`. – Evgeniy Jun 21 '21 at 17:01
  • sure, I have. From here: http://gnuwin32.sourceforge.net/packages/findutils.htm – Evgeniy Jun 21 '21 at 17:03
  • Both failed: `C:\GnuWin32\bin>sed 's/\r//$' "C:\urls.txt" | xargs curl | "C:\GnuWin32\bin\grep.exe" -Eo "http[^?[:space:]]+(jpg|png|webp)\b" > imglinks.txt sed: -e Ausdruck #1, Zeichen 1: Unbekannter Befehl: \`'' curl: try "curl --help" for more information` `C:\GnuWin32\bin>sed "s/\r//$" "C:\urls.txt" | xargs curl | "C:\GnuWin32\bin\grep.exe" -Eo "http[^?[:space:]]+(jpg|png|webp)\b" > imglinks.txt sed: -e Ausdruck #1, Zeichen 7: Unbekannte Option fcurl: try "curl --help" for more information` – Evgeniy Jun 21 '21 at 17:06
  • `C:\GnuWin32\bin>curl -n 1 < "C:\urls.txt" | "C:\GnuWin32\bin\grep.exe" -Eo "http[^?[:space:]]+(jpg|png|webp)\b" > imglinks.txt % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0curl: (6) Could not resolve host: 1` – Evgeniy Jun 21 '21 at 17:15
  • 1
    Try this one: `xargs -n 1 curl < "C:\urls.txt" | "C:\GnuWin32\bin\grep.exe" -Eo "http[^?[:space:]]+(jpg|png|webp)\b" > imglinks.txt` – anubhava Jun 21 '21 at 17:16
  • It began to work and worked longer then previously - but stopped after around 100 urls :( :( :( oh my bad luck... – Evgeniy Jun 21 '21 at 17:22
  • It should work by running one URL at a time. May be there is some resource issue on your Windows machine. – anubhava Jun 21 '21 at 17:25
  • to be exact, it stopped after 82 urls. While running something like `wget.exe -i imagelist.txt` there were no any issues with it. – Evgeniy Jun 21 '21 at 17:27
  • You can try to replace `curl` with `wget` – anubhava Jun 21 '21 at 17:29
  • using `wget` I get 403 :( – Evgeniy Jun 21 '21 at 17:30
  • So if you get 403 with `wget` then how come there are no issues with `wget -i`? – anubhava Jun 21 '21 at 17:31
  • 1
    Anyway there is no restriction with `xargs` command to stop after 82 or 100 unless there are some external restrictions on that system. I have used `xargs` to process millions of lines. – anubhava Jun 21 '21 at 17:33
  • I was doing `wget` accessing image urls directly. It seems the website prevents direct access to html pages with `wget` - but `cUrl` can access them. – Evgeniy Jun 21 '21 at 17:36
  • 1
    `while` doesn't work on my WIndows. I tested `xargs -n 1 curl < "C:\urls.txt" | "C:\GnuWin32\bin\grep.exe" -Eo "http[^?[:space:]]+(jpg|png|webp)\b" > imglinks.txt` on another machine in another network - it stopped again after exactly 82 urls. – Evgeniy Jun 21 '21 at 18:02
  • Test it on a non-windows machine if you can – anubhava Jun 21 '21 at 18:09
1

You can use

curl "https://watchbase.com/sellita"  | "C:\GnuWin32\bin\grep.exe" -oE "http[^?[:space:]]+(jpg|png|webp)\b"  > imglinks.txt

The 'png' is not recognized as an internal or external command, operable program or batch file issue is due to the use of single quotation marks. You should use double quotation marks in Windows grep.

To read all URLs from a file and process them, you may use

FOR /F %i in (C:\urls.txt) DO curl %i | "C:\GnuWin32\bin\grep.exe" -oP "http[^?\s]+(jpg|png|webp)\b" >> imglinks.txt
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Yes, it works indeed! Is there a possibility to use a file with an url list instead of single url? I tried `curl "C:\urls.txt"` with and without `-K` - but no luck with both... – Evgeniy Jun 21 '21 at 16:00
  • @Evgeniy I have already shared that: `"C:\GnuWin32\bin\grep.exe" -oE "http[^?[:space:]]+(jpg|png|webp)\b" "C:\urls.txt" > imglinks.txt`. You should pass the file path to `grep` directly. – Wiktor Stribiżew Jun 21 '21 at 16:04
  • on this way I get only empty `imglinks.txt` – Evgeniy Jun 21 '21 at 16:10
  • @Evgeniy What is inside `"C:\urls.txt"`? The text where the URLs to extract are? Or a list of Web pages that you need to scape? – Wiktor Stribiżew Jun 21 '21 at 16:11
  • urls like in your answer, separated with `\r\n` – Evgeniy Jun 21 '21 at 16:12
  • @Evgeniy You can read a file line by line using `FOR /F %i in (C:\urls.txt) DO your_command` – Wiktor Stribiżew Jun 21 '21 at 18:08
  • @Evgeniy So, did you try this in the `cmd`? Or, are you running it from a batch file? – Wiktor Stribiżew Jun 22 '21 at 09:16
  • I still struggle with syntax, can't get it run from cmd – Evgeniy Jun 22 '21 at 09:18
  • 1
    @Evgeniy See [this demo screenshot](https://imgur.com/a/21FCqiR), it works fine. – Wiktor Stribiżew Jun 22 '21 at 09:29
  • For a single URL I've already managed it out. Now my last problem is - to read urls from a txt file and to save image urls into another txt file. – Evgeniy Jun 22 '21 at 09:34
  • @Evgeniy Ok, so that means the current question is answered, right? Now, you want to save the URLs you extract from certain Web pages into separate TXT files, don't you? Sorry, it is not clear what your problem is, I already posted how to read a TXT file line by line and extract the pattern matches into a TXT file. – Wiktor Stribiżew Jun 22 '21 at 09:35
  • hmm, my initial question is about reading urls from txt file and writing image urls into another txt file. The problem hasn't changed while discussion. Do you have another sight on it? – Evgeniy Jun 22 '21 at 10:06
  • 1
    @Evgeniy "into another txt file" - this is already solved. `FOR /F %i in (C:\1\1.txt) DO ("c:\Program Files\Git\mingw64\bin\curl.exe" %i | "c:\Program Files (x86)\GnuWin32\bin\grep.exe" -oP "http[^?\s]+(jpg|png|webp)\b" >> imglinks.txt)` (that I execute on my end) writes all the links found on the Web pages (listed in the `C:\1\1.txt` file) in the `imglinks.txt` file. – Wiktor Stribiżew Jun 22 '21 at 10:14
  • After pressing Enter with `FOR /F %i in (C:\urls.txt) DO ("C:\GnuWin32\bin\grep.exe" %i | "c:\Program Files (x86)\GnuWin32\bin\grep.exe" -oP "http[^?\s]+(jpg|png|webp)\b" >> imglinks.txt` I get just `more?` It looks like on screenshot: https://easycaptures.com/fs/uploaded/1473/9830351270.png – Evgeniy Jun 22 '21 at 11:11
  • @Evgeniy You are passing the `urls.txt` to grep, why? You say you have URLs there, so pass the `urls.txt` to **curl**. – Wiktor Stribiżew Jun 22 '21 at 11:13
0

It's really bad practice trying to parse HTML with RegEX! And to see senior members even encouraging this really makes me want to cry. This way the constant flood of these questions will never end.

Please have a look at:

To parse HTML please use an HTML parser like , , or !

<picture>
<source type="image/webp" data-srcset="https://cdn.watchbase.com/caliber/md/origin:png/sellita/sw200-1-bd.webp" srcset="https://assets.watchbase.com/img/FFFFFF-0.png" />
<img class="lazyload" data-src="https://cdn.watchbase.com/caliber/md/sellita/sw200-1-bd.png" src="https://assets.watchbase.com/img/FFFFFF-0.png" alt="Sellita caliber SW200-1"/>
</picture>

https://assets.watchbase.com/img/FFFFFF-0.png is just 1 white pixel and returns in every single <picture>-node. So I'm going to assume you just want the attributes data-srcset and data-src.

xidel -s "https://watchbase.com/sellita" -e "//picture/(source/@data-srcset,img/@data-src)"

You can also use xidel (with just 1 invocation) to process the urls you have in "C:\urls.txt" (assuming they all have the same <picture>-nodes as https://watchbase.com/sellita).

xidel -s "C:\urls.txt" -e "for $url in x:lines($raw) return doc($url)//picture/(source/@data-srcset,img/@data-src)" > imglinks.txt

or

xidel -se "for $url in file:read-text-lines('C:\urls.txt') return doc($url)//picture/(source/@data-srcset,img/@data-src)" > imglinks.txt

If you're goal is to download all images from 'imglinks.txt', then xidel can do this too.

xidel -s "C:\urls.txt" -f "for $url in x:lines($raw) return doc($url)//picture/(source/@data-srcset,img/@data-src)" --download "."

or

xidel -s --xquery "for $url in file:read-text-lines('C:\urls.txt') for $img in doc($url)//picture/(source/@data-srcset,img/@data-src) return file:write-binary(tokenize($img,'/')[last()],string-to-base64Binary(x:request($img)/raw))"

xidel -s --xquery ^"^
  for $url in file:read-text-lines('C:\urls.txt')^
  for $img in doc($url)//picture/(source/@data-srcset,img/@data-src)^
  return^
  file:write-binary(^
    tokenize($img,'/')[last()],^
    string-to-base64Binary(x:request($img)/raw)^
  )^
"
Reino
  • 3,203
  • 1
  • 13
  • 21