Let me start off by showing the 3 different type of strings I will be dealing with:
"<h1>Money Shake</h1><p>Money<br>Money<br>MORE MONEY</p><p>Take money and stuff in blender.</p><p>Blend.</p>"
"<h1>Money Shake</h1><p>Posted by Gordon Gekko</p><p>Money<br>Money<br>MORE MONEY</p><p>Take money and stuff in blender.</p><p>Blend.</p>"
"<h1>Money Shake</h1><p>Posted by Gordon Gekko</p><p>They're great</p><p>Yield: KA-CHING</p><p>Money<br>Money<br>MORE MONEY</p><p>Take money and stuff in blender.</p><p>Blend.</p>"
Essentially, what I wish to do is to rip out the chunk that has the ingredients:
"<p>Money<br>Money<br>MORE MONEY</p>"
This is the regex that I am using:
re.search(r'<p>[^</p>](.*)<br>(.*?)</p>', string, re.I)
When I use this on the first and second string, it does exactly what I want and returns me this match object:
"<p>Money<br>Money<br>MORE MONEY</p>"
But when I use this on the third string, it returns me this match object:
"<p>They're great</p><p>Yield: KA-CHING</p><p>Money<br>Money<br>MORE MONEY</p>"
What am I screwing up?
@Blender
Hi Blender, this is what I came up with in grabbing the chunks I want. I'm sure there is a better way, but consider that I'm 2 weeks into Python / programming:
def get_ingredients(soup):
for p in soup.find_all('p'):
if p.find('br'):
return p
ingredients = get_ingredients(soup)
p_list = soup.find_all('p')
ingredient_index = p_list.index(ingredients)
junk = []
junk += p_list[:ingredient_index]
instructions = []
instructions += p_list[ingredient_index+1:]