How can I extract a
containing
elements with a regex?

Question

Let me start off by showing the 3 different type of strings I will be dealing with:

"<h1>Money Shake</h1><p>Money<br>Money<br>MORE MONEY</p><p>Take money and stuff in blender.</p><p>Blend.</p>"

"<h1>Money Shake</h1><p>Posted by Gordon Gekko</p><p>Money<br>Money<br>MORE MONEY</p><p>Take money and stuff in blender.</p><p>Blend.</p>"

"<h1>Money Shake</h1><p>Posted by Gordon Gekko</p><p>They're great</p><p>Yield: KA-CHING</p><p>Money<br>Money<br>MORE MONEY</p><p>Take money and stuff in blender.</p><p>Blend.</p>"

Essentially, what I wish to do is to rip out the chunk that has the ingredients:

"<p>Money<br>Money<br>MORE MONEY</p>"

This is the regex that I am using:

re.search(r'<p>[^</p>](.*)<br>(.*?)</p>', string, re.I)

When I use this on the first and second string, it does exactly what I want and returns me this match object:

"<p>Money<br>Money<br>MORE MONEY</p>"

But when I use this on the third string, it returns me this match object:

"<p>They're great</p><p>Yield: KA-CHING</p><p>Money<br>Money<br>MORE MONEY</p>"

What am I screwing up?

@Blender

Hi Blender, this is what I came up with in grabbing the chunks I want. I'm sure there is a better way, but consider that I'm 2 weeks into Python / programming:

def get_ingredients(soup):
   for p in soup.find_all('p'):
       if p.find('br'):
           return p

ingredients = get_ingredients(soup)

p_list = soup.find_all('p')

ingredient_index = p_list.index(ingredients)

junk = []

junk += p_list[:ingredient_index]

instructions = []

instructions += p_list[ingredient_index+1:]

"What am I screwing up?" I don't want to be judgmental, but the consensus here is that HTML and regular expressions don't mix. Even in the blender. — Joe, Aug 21 '13 at 16:07
Take a look at http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Joe, Aug 21 '13 at 16:08
`[^]` is not really what you think it is. It is "not any of `<>/p` symbols". I think you need `(?!)` — mishik, Aug 21 '13 at 16:08
@Joe - Well... you [*can* use regexes to parse HTML](http://stackoverflow.com/a/4234491/211627), but it is not adivsable. The [tag:regex] [wiki](http://stackoverflow.com/tags/regex/info) specifically addresses this type of question, though. — JDB, Aug 21 '13 at 16:20
I am not trolling. Granted I am no expert like you guys, but that was what I came up with. I have seen 1732348 but that's not very helpful in this case to just throw that at me. I understand you might think I am trolling because of the content in the html tags. Ok, I have been assigned to go through a database of recipes and to isolate the ingredients and the instructions. This is why I am doing this. I didn't want to put up a proper recipe, and hence made this Money Shake example. That is all. Please stop bashing a newbie. I am trying my best to learn. — shabalong, Aug 21 '13 at 16:20
possible duplicate of [Regular expression pattern not matching anywhere in string](http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string) — JDB, Aug 21 '13 at 16:21
Sometimes being new to StackOverflow is a bit like being new to skateboarding... you are going to get bruised if you aren't careful, or maybe even if you are, and it just takes practice to get it right. But for many, it's worth it. — JDB, Aug 21 '13 at 16:25
Folks, I think you might have jumped the gun on this one. This is not who you think this is. — Brad Larson, Aug 21 '13 at 16:46
Please make your question easier to parse, so that I can revert my downvote... — Antti Haapala -- Слава Україні, Aug 21 '13 at 16:55

score 3 · Answer 1 · answered Aug 21 '13 at 16:38

3

Just use a proper HTML parser. It'll be more intuitive than regex and will actually work:

# May need to install it:
# pip install BeautifulSoup4

from bs4 import BeautifulSoup

soup = BeautifulSoup("""
    <h1>Money Shake</h1>
    <p>Posted by Gordon Gekko</p>
    <p>They're great</p>
    <p>Yield: KA-CHING</p>
    <p>
        Money
        <br>
        Money
        <br>
        MORE MONEY
    </p>
    <p>Take money and stuff in blender.</p>
    <p>Blend.</p>
""")

def get_ingredients(soup):
    for p in soup.find_all('p'):
        if p.find('br'):
            return p.find_all(text=True)

answered Aug 21 '13 at 16:38

Blender

289,723
53
439
496

Hi Blender, this is excellent! Thank you! I'm not sure how to turn this into what I was trying to do though. Ok, so what I had intended to do is take the string, and then write a csv file with columns: 'Header', 'Junk', 'Ingredients', 'Instructions', and pull out the chunks into their respective columns, i.e. "
Money Shake
" under 'Header', "
Posted by Gordon Gekko
They're great
Yield: KA-CHING
" under 'Junk', etc. Sorry for being such a hassle. I appreciate the help. – shabalong Aug 21 '13 at 19:11
That I am a noob is clear from how I can't even format my response here in the comment, or maybe there isn't a way to do so for what I wanted to reply with. @Blender I am going to edit my question to show you what I came up with. – shabalong Aug 21 '13 at 19:55

How can I extract a containing elements with a regex?

1 Answers1

Money Shake

How can I extract a
containing
elements with a regex?