0

using the following code im getting all the url in a site

while( $html =~ m/<A HREF=\"(.*?)\"/g ) {    
      print "$1\n";  
  }

which gives me all the URL . but my question is i wanna extract only the url ends with

1) .pdf

or

2) .doc

for example

http://nc.casaforchildren.org/files/public/site/jobs/CSO.pdf

can any one help me thanks .

Toto
  • 89,455
  • 62
  • 89
  • 125
backtrack
  • 7,996
  • 5
  • 52
  • 99

3 Answers3

1
 m/<A HREF=\"(.*?(.pdf|.doc))\"/g

Its working at my place:

> cat temp
<A HREF="http://nc.casaforchildren.org/files/public/site/jobs/CSO.pdf">bwfjbwej</A>
<A HREF="http://nc.casaforchildren.org/files/public/site/jobs/CSO.xls">bwfjbwej</A>
<A HREF="http://nc.casaforchildren.org/files/public/site/jobs/CSO.doc">bwfjbwej</A>

> perl -lne 'print $1 if(/<A HREF=\"(.*?(.pdf|.doc))\"/g)' temp
http://nc.casaforchildren.org/files/public/site/jobs/CSO.pdf
http://nc.casaforchildren.org/files/public/site/jobs/CSO.doc
>
Vijay
  • 65,327
  • 90
  • 227
  • 319
1

If your grouping (.*?) matches all URLs, you should go with:

while( $html =~ m/<A HREF=\"(.*?(\.pdf|\.doc))\"/g ) {    
      print "$1\n";  
  }

Be aware that this matches also .pdf which might not be what you are searching. The pattern .*? is greedy and quite dangerous imo.

/edit

I tried it on http://regexpal.com/

\b(.*(\.pdf|\.doc))\b

for

http://nc.casaforchildren.org/files/public/site/jobs/CSO.pdf
http://nc.casaforchildren.org/files/public/site/jobs/CSO.doc
http://nc.casaforchildren.org/files/public/site/jobs/CSO.pdd
.pdf
http://nc.casaforchildren.org/files/public/site/jobs/CSO.pdfawd

It matches just the first two URLs.

EverythingRightPlace
  • 1,197
  • 12
  • 33
  • What are you observing? It matches all URLs once again? – EverythingRightPlace Aug 22 '13 at 07:26
  • @bashophil.. i saw a blank black screen – backtrack Aug 22 '13 at 07:28
  • You could try `(.*\.pdf|.*\.doc)`. Btw I would suggest to add boundaries around your pattern: `\b` – EverythingRightPlace Aug 22 '13 at 07:30
  • +1 you gave me a correct answer but it your answer is case sensitive – backtrack Aug 22 '13 at 07:47
  • @Backtrack the case sensitivity option will just add the tags `PDF` or `pDf` or something like this. I am not sure but I guess this is irrelevant. The stuff before `pdf` will match nevertheless because `.*` is greedy. Btw `.*?` is the same like `.*`. Anything matches zero times up to infinity times. – EverythingRightPlace Aug 22 '13 at 07:53
  • @bashophil in that page " A HREF" and "a href" both things are present in that case yours will fail actually mine too , so that only i accepted M42's answer – backtrack Aug 22 '13 at 09:01
  • No problem but we can't guess what's important for your specific problem. You have to describe it. Nonetheless, like I wrote, watch out for escaping the dot with `\.` otherwise it matches literally anything. Besides that I would suggest to use word boundaries (see my examples). – EverythingRightPlace Aug 22 '13 at 10:05
1

I guess you need to search case insensitive:

while( $html =~ m/<A HREF="(.*?\.(?:pdf|doc))"/ig ) {    
    print "$1\n";  
}
Toto
  • 89,455
  • 62
  • 89
  • 125