-3

Let me start off by showing the 3 different type of strings I will be dealing with:

"<h1>Money Shake</h1><p>Money<br>Money<br>MORE MONEY</p><p>Take money and stuff in blender.</p><p>Blend.</p>"

"<h1>Money Shake</h1><p>Posted by Gordon Gekko</p><p>Money<br>Money<br>MORE MONEY</p><p>Take money and stuff in blender.</p><p>Blend.</p>"

"<h1>Money Shake</h1><p>Posted by Gordon Gekko</p><p>They're great</p><p>Yield: KA-CHING</p><p>Money<br>Money<br>MORE MONEY</p><p>Take money and stuff in blender.</p><p>Blend.</p>"

Essentially, what I wish to do is to rip out the chunk that has the ingredients:

"<p>Money<br>Money<br>MORE MONEY</p>"

This is the regex that I am using:

re.search(r'<p>[^</p>](.*)<br>(.*?)</p>', string, re.I)

When I use this on the first and second string, it does exactly what I want and returns me this match object:

"<p>Money<br>Money<br>MORE MONEY</p>"

But when I use this on the third string, it returns me this match object:

"<p>They're great</p><p>Yield: KA-CHING</p><p>Money<br>Money<br>MORE MONEY</p>"

What am I screwing up?


@Blender

Hi Blender, this is what I came up with in grabbing the chunks I want. I'm sure there is a better way, but consider that I'm 2 weeks into Python / programming:

def get_ingredients(soup):
   for p in soup.find_all('p'):
       if p.find('br'):
           return p

ingredients = get_ingredients(soup)

p_list = soup.find_all('p')

ingredient_index = p_list.index(ingredients)

junk = []

junk += p_list[:ingredient_index]

instructions = []

instructions += p_list[ingredient_index+1:]
Andrew Barber
  • 39,603
  • 20
  • 94
  • 123
shabalong
  • 45
  • 4
  • 5
    "What am I screwing up?" I don't want to be judgmental, but the consensus here is that HTML and regular expressions don't mix. Even in the blender. – Joe Aug 21 '13 at 16:07
  • 3
    Take a look at http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Joe Aug 21 '13 at 16:08
  • 1
    `[^]` is not really what you think it is. It is "not any of `<>/p` symbols". I think you need `(?!)` – mishik Aug 21 '13 at 16:08
  • @Joe - Well... you [*can* use regexes to parse HTML](http://stackoverflow.com/a/4234491/211627), but it is not adivsable. The [tag:regex] [wiki](http://stackoverflow.com/tags/regex/info) specifically addresses this type of question, though. – JDB Aug 21 '13 at 16:20
  • 3
    I am not trolling. Granted I am no expert like you guys, but that was what I came up with. I have seen 1732348 but that's not very helpful in this case to just throw that at me. I understand you might think I am trolling because of the content in the html tags. Ok, I have been assigned to go through a database of recipes and to isolate the ingredients and the instructions. This is why I am doing this. I didn't want to put up a proper recipe, and hence made this Money Shake example. That is all. Please stop bashing a newbie. I am trying my best to learn. – shabalong Aug 21 '13 at 16:20
  • possible duplicate of [Regular expression pattern not matching anywhere in string](http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string) – JDB Aug 21 '13 at 16:21
  • 1
    Sometimes being new to StackOverflow is a bit like being new to skateboarding... you are going to get bruised if you aren't careful, or maybe even if you are, and it just takes practice to get it right. But for many, it's worth it. – JDB Aug 21 '13 at 16:25
  • 3
    Folks, I think you might have jumped the gun on this one. This is not who you think this is. – Brad Larson Aug 21 '13 at 16:46
  • Yes, not the one who I thought this is. – Antti Haapala -- Слава Україні Aug 21 '13 at 16:52
  • Please make your question easier to parse, so that I can revert my downvote... – Antti Haapala -- Слава Україні Aug 21 '13 at 16:55

1 Answers1

3

Just use a proper HTML parser. It'll be more intuitive than regex and will actually work:

# May need to install it:
# pip install BeautifulSoup4

from bs4 import BeautifulSoup

soup = BeautifulSoup("""
    <h1>Money Shake</h1>
    <p>Posted by Gordon Gekko</p>
    <p>They're great</p>
    <p>Yield: KA-CHING</p>
    <p>
        Money
        <br>
        Money
        <br>
        MORE MONEY
    </p>
    <p>Take money and stuff in blender.</p>
    <p>Blend.</p>
""")

def get_ingredients(soup):
    for p in soup.find_all('p'):
        if p.find('br'):
            return p.find_all(text=True)
Blender
  • 289,723
  • 53
  • 439
  • 496
  • Hi Blender, this is excellent! Thank you! I'm not sure how to turn this into what I was trying to do though. Ok, so what I had intended to do is take the string, and then write a csv file with columns: 'Header', 'Junk', 'Ingredients', 'Instructions', and pull out the chunks into their respective columns, i.e. "

    Money Shake

    " under 'Header', "

    Posted by Gordon Gekko

    They're great

    Yield: KA-CHING

    " under 'Junk', etc. Sorry for being such a hassle. I appreciate the help.
    – shabalong Aug 21 '13 at 19:11
  • That I am a noob is clear from how I can't even format my response here in the comment, or maybe there isn't a way to do so for what I wanted to reply with. @Blender I am going to edit my question to show you what I came up with. – shabalong Aug 21 '13 at 19:55