0

I am trying to write a program that can take any recipe made in WPRM (wordpress recipe maker) and automatically rescale it. It is supposed to take the HTML code of the print version of such a recipe (see for example this cookie recipe) and extract the relevant information (amount, unit, name) for each ingredient using regular expressions. The relevant parts are enclosed by something along the lines of ...ingredient-attribute"> and . However, this sometimes is not recognized.

This is the relevant part of the code:

ingredients = re.split(r'<li', recipestr_cut) 
    
    for x in ingredients:
        
        out = [] # initialize a single element of the recipe as a list
        
        amount_m = re.search(r'ingredient-amount"\>(.+)\<\/span>', x)
        if amount_m:
            out.append(amount_m.group(1))
        unit_m = re.search(r'ingredient-unit"\>(.+)\<\/span>', x)
        if unit_m:
            out.append(unit_m.group(1))
        ingredient_m = re.search(r'ingredient-name"\>(.+)\<\/span>', x)
        if ingredient_m:
            out.append(ingredient_m.group(1))
        
        if len(out) > 0:
            recipe_readable.append(out)

For the recipe in the example before, this works neatly and returns

[['3', 'cups', '(380 grams) all-purpose flour'], ['1', 'teaspoon', 'baking soda'], ['1', 'teaspoon', 'fine sea salt'], ['2', 'sticks (227 grams) unsalted butter, at cool room temperature (67°F)'], ['1/2', 'cup', '(100 grams) granulated sugar'], ['1 1/4', 'cups', '(247 grams) lightly packed light brown sugar'], ['2', 'teaspoons', 'vanilla'], ['2', 'large eggs, at room temperature'], ['2', 'cups', '(340 grams) semisweet chocolate chips']]

However, if one instead uses a somewhat more complicated recipe, for example this pork bun recipe, it apparently no longer recognizes the </span> delimiter, because the output looks like this:

[['2/3</span>&#32;<span class="wprm-recipe-ingredient-unit">cup</span>&#32;<span class="wprm-recipe-ingredient-name">heavy cream', 'cup</span>&#32;<span class="wprm-recipe-ingredient-name">heavy cream</span>&#32;<span class="wprm-recipe-ingredient-notes wprm-recipe-ingredient-notes-faded">(at room temperature)', ...

Which clearly includes the </span>, and also does not seem to split the string in the right places. Instead, I would expect something like

[['2/3','cup','heavy cream'],...

How could this be fixed? I have only very recently learned about regular expressions, so this is still scary to me.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
paulina
  • 1
  • 2
  • 2
    Obligatory link: [You can't parse \[X\]HTML with regex](https://stackoverflow.com/a/1732454). The site you linked to has elements with semantically-named classes, so it shouldn't be too hard to query them using CSS selector or XPath. – InSync May 08 '23 at 16:25
  • 1
    You need to at least use non-greedy regexp: `.*?` – Barmar May 08 '23 at 16:27

2 Answers2

1

Instead of regex, use an HTML parser, like BeautifulSoup:

from bs4 import BeautifulSoup

def get_ingredients_info_from_html(html):
  soup = BeautifulSoup(html, 'html.parser')

  ingredients = []
  class_name = 'wprm-recipe-ingredient'

  for ingredient in soup.select(f'.{class_name}'):
    record = dict.fromkeys(['amount', 'unit', 'name', 'notes'])

    for info in record:
      element = ingredient.select_one(f'.{class_name}-{info}')

      if element:
        record[info] = element.text
    
    ingredients.append(record)

  return ingredients

Try it:

import requests

url = 'https://thewoksoflife.com/wprm_print/31141'

print(get_ingredients_info_from_html(requests.get(url).content))

'''
[
    {
        'amount': '2/3',
        'unit': 'cup',
        'name': 'heavy cream',
        'notes': '(at room temperature)'
    },
    {
        'amount': '1',
        'unit': 'cup',
        'name': 'milk',
        'notes': '(whole milk preferred, but you can use 2%, at room 
temperature)'
    },
    ...
]
'''
InSync
  • 4,851
  • 4
  • 8
  • 30
  • What would happen if `wprm-recipe-ingredient-amount` were missing in the record ? Would the other record elements populate ? In the html, each record is contained within it's `
  • ` tags. I'm not an html expert but if a record element is missing what's to stop it from going outside of this record to find the next available element ?
  • – sln May 08 '23 at 17:28
  • @sln I'm not sure if I understand what you mean. `record`'s values are default to `None`. If an element is not found, the `if element:` block won't run and the corresponding value remain `None`. `.select_one()` (and `.select()`, for that matter) only return descendants of the node it's called on, so other `
  • `s' info wouldn't mess with the current one's.
  • – InSync May 08 '23 at 18:21
  • Not sure I understand what you said except for `other
  • s' info wouldn't mess with the current one`. Which was my question. If I ever try to do this I still won't know how. In fact that didn't explain anything.
  • – sln May 08 '23 at 18:36