string function in Python to extract between two characters

Question

I have the below string and I want to extract everything from <img... to the closing " after .jpg.

I tried the below, but it doesn't find just the first " but rather the very end.

Can anyone help?

In [14]: start = 'img src="'
In [15]: end = '"'
print string[string.find(start)+len(start):string.rfind(end)]

STRING:

 <p><a href="https://news.yahoo.com/us-ambassador-takes-post-united-nations-141833297.html"><img src="http://l1.yimg.com/uu/api/res/1.2/1f8jyGM.NfkxLb_.OgMaIQ--/YXBwaWQ9eXRhY2h5b247aD04Njt3PTEzMDs-/http://media.zenfs.com/en_us/News/afp.com/f5bbc19135065fcfff40e6ece9650f4ab225fa97.jpg" width="130" height="86" alt="New US ambassador takes up post at United Nations" align="left" title="New US ambassador takes up post at United Nations" border="0" ></a>US Ambassador Kelly Craft took up her post at the United Nations on Thursday, vowing to defend America's values and interests nine months after the departure of her high-profile predecessor Nikki Haley. Craft, 57, served previously as US ambassador to Canada where she was involved in negotiations on a new US Mexico Canada free trade agreement.<p><br clear="all">

Are you trying to parse/scrap html? There are libraries like BeautifulSoup for this kind of stuff. — Nf4r, Sep 12 '19 at 20:57
No, it comes from an XML rss, it just happens to have the HTML tags still in it :) — kikee1222, Sep 12 '19 at 21:00
You can try to find working RE for this case. It is good to test it here: https://regex101.com/ — Nf4r, Sep 12 '19 at 21:06
@kikee1222 How are you originally obtaining this information? — Life is complex, Sep 13 '19 at 11:35

Shivaraj · Answer 1 · 2019-09-12T21:25:32.303

0

You can use Regex like this, if you are sure it would be always same.

<img.*?jpg\"

Here is the link for it, Regex101 You can tweak as you want though depending upon your requirements. Regex is the right tool for it instead of sting find and len and all that.

edited Sep 12 '19 at 21:25

answered Sep 12 '19 at 21:07

Shivaraj

400
5
16

1

no, generally, regex is not the right tool for parsing XML. An XML parser is – juanpa.arrivillaga Sep 12 '19 at 21:09
Doesnt the question says String?? sorry If i missed it – Shivaraj Sep 12 '19 at 21:10
Yes, the string is XML. – juanpa.arrivillaga Sep 12 '19 at 21:11
Then he can definitely use the XML parser for it. If its only this part he is interested, he can give a try. I answered as per question :P – Shivaraj Sep 12 '19 at 21:12
1

See this famous question/answer: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags Now, don't get me wrong, people use regex to extract information from html/xml etc all the time, but it is lazy and a bad practice. – juanpa.arrivillaga Sep 12 '19 at 21:12
yea, for sure. I have seen ppl using Regex to parse everything.. Thanks for the link though – Shivaraj Sep 12 '19 at 21:13

score 0 · Answer 2 · edited Sep 12 '19 at 21:38

You could just use the .split() function, if you don't want to use a reg ex.

str = """<p><a href="https://news.yahoo.com/us-ambassador-takes-post-united-nations-141833297.html"><img src="http://l1.yimg.com/uu/api/res/1.2/1f8jyGM.NfkxLb_.OgMaIQ--/YXBwaWQ9eXRhY2h5b247aD04Njt3PTEzMDs-/http://media.zenfs.com/en_us/News/afp.com/f5bbc19135065fcfff40e6ece9650f4ab225fa97.jpg" width="130" height="86" alt="New US ambassador takes up post at United Nations" align="left" title="New US ambassador takes up post at United Nations" border="0" ></a>US Ambassador Kelly Craft took up her post at the United Nations on Thursday, vowing to defend America's values and interests nine months after the departure of her high-profile predecessor Nikki Haley. Craft, 57, served previously as US ambassador to Canada where she was involved in negotiations on a new US Mexico Canada free trade agreement.<p><br clear="all">"""


#final should just be the url
final = str.split("img src=\"")[1].split("\" width=")[0]

print(final)

Output:

http://l1.yimg.com/uu/api/res/1.2/1f8jyGM.NfkxLb_.OgMaIQ--/YXBwaWQ9eXRhY2h5b247aD04Njt3PTEzMDs-/http://media.zenfs.com/en_us/News/afp.com/f5bbc19135065fcfff40e6ece9650f4ab225fa97.jpg

this outputs _all_ links in a single string, probably not ideal — wpercy, Sep 12 '19 at 21:39
True but split("http://") and then just adding it back to each string gives you an array of the urls. Also the question was to get the string between the two characters that this code does. — Parcevel, Sep 12 '19 at 21:42

string function in Python to extract between two characters

2 Answers2