Extract pdf from html using sed

Question

I am writing a bash script that extracts pdf files from html and downloads it. Here is the line of code that extracts:

 curl -s https://info.uqam.ca/\~privat/INF1070/ |
              sed 's/.*href="//' |
              sed 's/".*//' |
              sed '/^[^\.]/d' |
              sed '/\.[^p][^d][^f]$/d' |
              sed '/^$/d' |
              sed '/\/$/d'

Result:

./07b-reseau.pdf
./07a-reseau.pdf
./06b-script.pdf
./06a-script.pdf
./05-processus.pdf
./04b-regex.pdf
./181-quiz1-g1-sujet.pdf
./03b-fichiers-solution.pdf
./04a-regex.pdf
./03d-fichiers.pdf
./03c-fichiers.pdf
./03b-fichiers.pdf
./03a-fichiers.pdf
./02-shell.pdf
./01-intro.pdf
./01-intro.pdf
./02-shell.pdf
./03a-fichiers.pdf
./03b-fichiers.pdf
./03b-fichiers-solution.pdf
./03c-fichiers.pdf
./03d-fichiers.pdf
./04a-regex.pdf
./04b-regex.pdf
./05-processus.pdf
./06a-script.pdf
./06b-script.pdf
./07a-reseau.pdf
./07b-reseau.pdf
./181-quiz1-g1-sujet.pdf

It's working fine but I was wondering if there is a better way (always by using sed) to do this with less sed commands.

Thank you.

Yes there is. You could build a regex that does all those steps in one sed command. Now it is cost / benefit that will justify if you do it or not. — Nic3500, Dec 12 '18 at 23:45

score 1 · Answer 1 · answered Dec 13 '18 at 00:14

You can translate your original question into something like How to output only captured groups with sed?. This one-liner should do the trick for you:

curl -s https://info.uqam.ca/\~privat/INF1070/ | sed -rn 's/.*href="(.*\.pdf)".*$/\1/p'

which produces the desired output.

Where the combination of the -n option (not to print) and the p flag (print what is matched) will print only the lines where substitution take place based on the regex .*href="(.*\.pdf)".*$. The value of the href attribute (the capture group in parenthesis) is back referenced with \1, thus the whole line is replaced with it.

score 0 · Answer 2 · answered Dec 14 '18 at 10:27

0

This might work for you (GNU sed):

sed -r '/\n/!s/href="(\.[^"]*\.pdf)"/\n\1\n/g;/\`[^\n]*\.pdf$/MP;D' file

This puts each pdf file into a separate line (multiple lines within a line) and only prints out a line that ends in .pdf.

answered Dec 14 '18 at 10:27

potong

55,640
6
51
83

Extract pdf from html using sed

2 Answers2