Extract URL from a list of url in perl

Question

using the following code im getting all the url in a site

while( $html =~ m/<A HREF=\"(.*?)\"/g ) {    
      print "$1\n";  
  }

which gives me all the URL . but my question is i wanna extract only the url ends with

1) .pdf

or

2) .doc

for example

http://nc.casaforchildren.org/files/public/site/jobs/CSO.pdf

can any one help me thanks .

I assume you understand all the standard caveats about not parsing HTML with regular expressions, and have a good reason for ignoring them :-) — Dave Cross, Aug 22 '13 at 09:44
There's a really good explanation in the accepted answer here - http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not — Dave Cross, Aug 22 '13 at 14:07

Vijay · Answer 1 · 2013-08-22T07:32:16.920

1

 m/<A HREF=\"(.*?(.pdf|.doc))\"/g

Its working at my place:

> cat temp
<A HREF="http://nc.casaforchildren.org/files/public/site/jobs/CSO.pdf">bwfjbwej</A>
<A HREF="http://nc.casaforchildren.org/files/public/site/jobs/CSO.xls">bwfjbwej</A>
<A HREF="http://nc.casaforchildren.org/files/public/site/jobs/CSO.doc">bwfjbwej</A>

> perl -lne 'print $1 if(/<A HREF=\"(.*?(.pdf|.doc))\"/g)' temp
http://nc.casaforchildren.org/files/public/site/jobs/CSO.pdf
http://nc.casaforchildren.org/files/public/site/jobs/CSO.doc
>

edited Aug 22 '13 at 07:32

answered Aug 22 '13 at 07:18

Vijay

65,327
90
227
319

You have to escape the `.` otherwise it matches anything and not a literal dot before pdf|doc. – EverythingRightPlace Aug 22 '13 at 07:45
+1 you gave me a correct answer but it your answer is case sensitive – backtrack Aug 22 '13 at 07:46

EverythingRightPlace · Answer 2 · 2013-08-22T07:35:47.827

1

If your grouping (.*?) matches all URLs, you should go with:

while( $html =~ m/<A HREF=\"(.*?(\.pdf|\.doc))\"/g ) {    
      print "$1\n";  
  }

Be aware that this matches also .pdf which might not be what you are searching. The pattern .*? is greedy and quite dangerous imo.

/edit

I tried it on http://regexpal.com/

\b(.*(\.pdf|\.doc))\b

for

http://nc.casaforchildren.org/files/public/site/jobs/CSO.pdf
http://nc.casaforchildren.org/files/public/site/jobs/CSO.doc
http://nc.casaforchildren.org/files/public/site/jobs/CSO.pdd
.pdf
http://nc.casaforchildren.org/files/public/site/jobs/CSO.pdfawd

It matches just the first two URLs.

edited Aug 22 '13 at 07:35

answered Aug 22 '13 at 07:21

EverythingRightPlace

1,197
12
33

What are you observing? It matches all URLs once again? – EverythingRightPlace Aug 22 '13 at 07:26
@bashophil.. i saw a blank black screen – backtrack Aug 22 '13 at 07:28
You could try `(.*\.pdf|.*\.doc)`. Btw I would suggest to add boundaries around your pattern: `\b` – EverythingRightPlace Aug 22 '13 at 07:30
+1 you gave me a correct answer but it your answer is case sensitive – backtrack Aug 22 '13 at 07:47
@Backtrack the case sensitivity option will just add the tags `PDF` or `pDf` or something like this. I am not sure but I guess this is irrelevant. The stuff before `pdf` will match nevertheless because `.*` is greedy. Btw `.*?` is the same like `.*`. Anything matches zero times up to infinity times. – EverythingRightPlace Aug 22 '13 at 07:53
@bashophil in that page " A HREF" and "a href" both things are present in that case yours will fail actually mine too , so that only i accepted M42's answer – backtrack Aug 22 '13 at 09:01
No problem but we can't guess what's important for your specific problem. You have to describe it. Nonetheless, like I wrote, watch out for escaping the dot with `\.` otherwise it matches literally anything. Besides that I would suggest to use word boundaries (see my examples). – EverythingRightPlace Aug 22 '13 at 10:05

score 1 · Accepted Answer · answered Aug 22 '13 at 07:31

1

I guess you need to search case insensitive:

while( $html =~ m/<A HREF="(.*?\.(?:pdf|doc))"/ig ) {    
    print "$1\n";  
}

answered Aug 22 '13 at 07:31

Toto

89,455
62
89
125

Extract URL from a list of url in perl

3 Answers3