0

I would like to extract multiple urls from a node and place them into a string array. Currently I'm saving all the text from the desired node into a string;

imgsUrl= value.text

then I am parsing the string and getting the correct url.

imgsUrl[imgUrl.find("http://"):imgUrl.find(".JPG")+4]

My issue with this is there could be 1-200 urls I need from imgsUrl, and I'm only able to obtain one of them. Is there a good solution to place all of them into an array that would be less tedious?

sample input:

sampleStr="<ul><li><a href="http://website/abc/vcd/HHD00300.JPG">HHD00300.JPG</a></li>
<li><a href="http://website/abc/vcd//HHD003002.jpg">HHD003002.jpg</a></li></ul>"

output:

print imgUrlSubString
outputs this:  http://website/abc/vcd//HHD003000.JPG

expected output:

['http://website/abc/vcd//HHD003000.JPG','http://website/abc/vcd//HHD003002.JPG',....]
BFlint
  • 2,407
  • 2
  • 16
  • 20
  • Can you post a sample input and the expected output? – vikramls Nov 20 '14 at 18:31
  • Regex should do the trick. See [this][1] answer. [1]: http://stackoverflow.com/a/6883094/447599 – Jules G.M. Nov 20 '14 at 18:51
  • @vikramls alright sample input with corresponding output has been included – BFlint Nov 20 '14 at 20:27
  • possible duplicate of [Python xml ElementTree from a string source?](http://stackoverflow.com/questions/647071/python-xml-elementtree-from-a-string-source) – ivan_pozdeev Nov 20 '14 at 20:38
  • @Julius This seems to work great. Is this a similar approach that niroyb mentioned below? If so, I'd like to mark one of these as the answer. thanks! – BFlint Nov 20 '14 at 21:07
  • vikramls answer is a better practice. however, the one I mentionned and niroyb are the same, and definitely would have worked in that context, and I really think that a true software engineer should know regex really well – Jules G.M. Nov 21 '14 at 04:30

3 Answers3

0

You can use the re.findall method. It returns all non overlapping regular expression matches directly in a list.

print( re.findall("http://.*?\.JPG", imgsUrl) )

Using ".*?" instead of ".*" is important in this case because there can be multiple urls so you want the non greedy match.

The best way to go though is to use an xml parser. For python, beautifulsoup and lxml are pretty popular.

See these answers:

Community
  • 1
  • 1
niroyb
  • 114
  • 6
  • 1
    Read http://stackoverflow.com/a/1732454/648265 at once. And each time you'll think of providing such an answer ever again. – ivan_pozdeev Nov 20 '14 at 20:34
0

Here's my answer - I used lxml.html to parse the HTML. It is generally a bad idea to use regexes to parse HTML (see @ivan_pozdeev's answer above).

import lxml.html

sampleStr='<ul><li><a href="http://website/abc/vcd/HHD00300.JPG">HHD00300.JPG</a></li><li><a href="http://website/abc/vcd//HHD003002.jpg">HHD003002.jpg</a></li></ul>'
html = lxml.html.fromstring(sampleStr)
print html.xpath('//a/@href')

The code uses an xpath expression to retrieve all the href properties in all a tags in the string sampleStr.

Sample output:

['http://website/abc/vcd/HHD00300.JPG', 'http://website/abc/vcd//HHD003002.jpg']
vikramls
  • 1,802
  • 1
  • 11
  • 15
  • Is it still possible to access html like an array, for example...print html[0] would print 'http://website/abc/vcd//HHD003000.JPG' – BFlint Nov 20 '14 at 21:27
  • Yes, you would store the expression like this: `href_list = html.xpath('//a/@href')` and you now have a list `href_list` which you can iterate over or access directly using `href_list[0]`. – vikramls Nov 20 '14 at 21:28
0

You can use BeautifulSoup to parse this string.

from bs4 import BeautifulSoup
soup = BeautifulSoup(sampleStr)
links = soup.find_all("a")
output = []
for link in links:
    output.append(link["href"])

And here's the output:

print(output)
>>> ['http://website/abc/vcd/HHD00300.JPG', 'http://website/abc/vcd//HHD003002.jpg']
xbb
  • 2,073
  • 1
  • 19
  • 34
  • Thanks, this method also works for my problem. Not sure if there's a better choice but both work, thanks a lot! – BFlint Nov 20 '14 at 21:40