1

I am writing a bash script that extracts pdf files from html and downloads it. Here is the line of code that extracts:

 curl -s https://info.uqam.ca/\~privat/INF1070/ |
              sed 's/.*href="//' |
              sed 's/".*//' |
              sed '/^[^\.]/d' |
              sed '/\.[^p][^d][^f]$/d' |
              sed '/^$/d' |
              sed '/\/$/d'

Result:

./07b-reseau.pdf
./07a-reseau.pdf
./06b-script.pdf
./06a-script.pdf
./05-processus.pdf
./04b-regex.pdf
./181-quiz1-g1-sujet.pdf
./03b-fichiers-solution.pdf
./04a-regex.pdf
./03d-fichiers.pdf
./03c-fichiers.pdf
./03b-fichiers.pdf
./03a-fichiers.pdf
./02-shell.pdf
./01-intro.pdf
./01-intro.pdf
./02-shell.pdf
./03a-fichiers.pdf
./03b-fichiers.pdf
./03b-fichiers-solution.pdf
./03c-fichiers.pdf
./03d-fichiers.pdf
./04a-regex.pdf
./04b-regex.pdf
./05-processus.pdf
./06a-script.pdf
./06b-script.pdf
./07a-reseau.pdf
./07b-reseau.pdf
./181-quiz1-g1-sujet.pdf

It's working fine but I was wondering if there is a better way (always by using sed) to do this with less sed commands.

Thank you.

  • Yes there is. You could build a regex that does all those steps in one sed command. Now it is cost / benefit that will justify if you do it or not. – Nic3500 Dec 12 '18 at 23:45

2 Answers2

1

You can translate your original question into something like How to output only captured groups with sed?. This one-liner should do the trick for you:

curl -s https://info.uqam.ca/\~privat/INF1070/ | sed -rn 's/.*href="(.*\.pdf)".*$/\1/p'

which produces the desired output.

Where the combination of the -n option (not to print) and the p flag (print what is matched) will print only the lines where substitution take place based on the regex .*href="(.*\.pdf)".*$. The value of the href attribute (the capture group in parenthesis) is back referenced with \1, thus the whole line is replaced with it.

marcell
  • 1,498
  • 1
  • 10
  • 22
0

This might work for you (GNU sed):

sed -r '/\n/!s/href="(\.[^"]*\.pdf)"/\n\1\n/g;/\`[^\n]*\.pdf$/MP;D' file

This puts each pdf file into a separate line (multiple lines within a line) and only prints out a line that ends in .pdf.

potong
  • 55,640
  • 6
  • 51
  • 83