0

I want to extract the .zip filenames from given URl. Here is my code-

import re

print(re.findall(r'href=[\'"]?([^\'" >]+)','<a href="http://www.example.com/files/world_data1.zip"><b>World Data Part 1</b></a> <br/> <a href="http://www.example.com/files/world_data2.zip"><b>World Data Part 2</b></a>'))

For example:

Input -<a href="http://www.example.com/files/world_data1.zip"><b>World Data Part 1</b></a> <br/> <a href="http://www.example.com/files/world_data2.zip"><b>World Data Part 2</b></a>

Expected Output - world_data1.zip,world_data2.zip.

I tried using .zip $ in various format but I get an empty list. Can anyone help me with this?

asherber
  • 2,508
  • 1
  • 15
  • 12
sqlfirst
  • 11
  • 3
  • 1
    Why are you parsing HTML with regex to begin with? Couldn't you use something like [`BeautifulSoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)? – G_M Feb 24 '18 at 20:53
  • [This may be of some help to you](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – mustachioed Feb 24 '18 at 23:20
  • I did know we can use BeautifulSoup library. But I am trying to solve it with regex approach. Thanks for the input by the way. – sqlfirst Feb 25 '18 at 16:58

2 Answers2

0

You could use

import re

html = """'&nbsp;<a href="http://www.example.com/files/world_data1.zip"><b>World Data Part 1</b></a> <br/> <a href="http://www.example.com/files/world_data2.zip"><b>World Data Part 2</b></a>'"""

rx = re.compile(r"""href=(["'])(.*?)\1""")
links = [filename 
    for m in rx.finditer(html) 
    for filename in [m.group(2).split('/')[-1]]
    if filename.endswith('.zip')]
print(links)

Yielding

['world_data1.zip', 'world_data2.zip']


The idea is to get the href attributes first, split by / and check if the last part ends with .zip.
However, consider using a parser like BeautifulSoup and some xpath queries instead.
See a demo on regex101.com for the expression.
Jan
  • 42,290
  • 8
  • 54
  • 79
0

You can try this:

import re

s = '&nbsp;<a href="http://www.example.com/files/world_data1.zip"><b>World Data Part 1</b></a> <br/> <a href="http://www.example.com/files/world_data2.zip"><b>World Data Part 2</b></a>'

print(re.findall(r'href="[^"]+?/([^/"]+\.zip)"', s))

Or, for more rigorously, use the following way:

import os

from pyquery import PyQuery as pq

doc = pq(s)
a_list = doc('a[href]')  # Get all `a` elements that have a `href` attrib.
hrefs = [os.path.basename(a.attrib['href']) for a in a_list]
print(list(filter(lambda x: x.endswith('.zip'), hrefs)))
DDGG
  • 1,171
  • 8
  • 22