Extracting .zip filenames from given URL using regex in Python

Question

I want to extract the .zip filenames from given URl. Here is my code-

import re

print(re.findall(r'href=[\'"]?([^\'" >]+)','<a href="http://www.example.com/files/world_data1.zip"><b>World Data Part 1</b></a> <br/> <a href="http://www.example.com/files/world_data2.zip"><b>World Data Part 2</b></a>'))

For example:

Input -<a href="http://www.example.com/files/world_data1.zip">World Data Part 1</a> <a href="http://www.example.com/files/world_data2.zip">World Data Part 2</a>

Expected Output - world_data1.zip,world_data2.zip.

I tried using .zip $ in various format but I get an empty list. Can anyone help me with this?

Why are you parsing HTML with regex to begin with? Couldn't you use something like [`BeautifulSoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)? — G_M, Feb 24 '18 at 20:53
[This may be of some help to you](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — mustachioed, Feb 24 '18 at 23:20
I did know we can use BeautifulSoup library. But I am trying to solve it with regex approach. Thanks for the input by the way. — sqlfirst, Feb 25 '18 at 16:58

score 0 · Answer 1 · answered Feb 24 '18 at 20:51

You could use

import re

html = """'&nbsp;<a href="http://www.example.com/files/world_data1.zip"><b>World Data Part 1</b></a> <br/> <a href="http://www.example.com/files/world_data2.zip"><b>World Data Part 2</b></a>'"""

rx = re.compile(r"""href=(["'])(.*?)\1""")
links = [filename 
    for m in rx.finditer(html) 
    for filename in [m.group(2).split('/')[-1]]
    if filename.endswith('.zip')]
print(links)

Yielding

['world_data1.zip', 'world_data2.zip']

The idea is to get the href attributes first, split by / and check if the last part ends with .zip.
However, consider using a parser like BeautifulSoup and some xpath queries instead.
See a demo on regex101.com for the expression.

Interesting. I have never used this finditer but I will try using this. Thanks for the help. — sqlfirst, Feb 25 '18 at 17:02

DDGG · Answer 2 · 2018-02-24T21:16:47.553

0

You can try this:

import re

s = '&nbsp;<a href="http://www.example.com/files/world_data1.zip"><b>World Data Part 1</b></a> <br/> <a href="http://www.example.com/files/world_data2.zip"><b>World Data Part 2</b></a>'

print(re.findall(r'href="[^"]+?/([^/"]+\.zip)"', s))

Or, for more rigorously, use the following way:

import os

from pyquery import PyQuery as pq

doc = pq(s)
a_list = doc('a[href]')  # Get all `a` elements that have a `href` attrib.
hrefs = [os.path.basename(a.attrib['href']) for a in a_list]
print(list(filter(lambda x: x.endswith('.zip'), hrefs)))

edited Feb 24 '18 at 21:16

answered Feb 24 '18 at 20:51

DDGG

1,171
8
22

I have tried the second approach you mentioned. Thanks for the help. Appreciate it! – sqlfirst Feb 25 '18 at 17:00

Extracting .zip filenames from given URL using regex in Python

2 Answers2