0

I'm trying to parse HTML in Python that has an inline script in it. I need to find a string inside of the script, then extract the value. I've been trying to do this in regex for the past few hours, but I'm still not convinced this is the correct approach.

Here is a sample:

['key_to_search_for']['post_date'] = '10 days ago';

The result I want to extract is: 10 days ago

This regex gets me part of the way, but I can't figure out the full match:

^\[\'key_to_search_for\'\]\[\'post_date\'\] = '(\d{1,2})+( \w)

Regex playground

However, even once I can match with regex, I'm not sure the best way to get only the value. I was thinking of just replacing the keys with blanks, like .replace('['key_to_search_for']['post_date'] = '',''), but that seems inefficient.

Should I be matching the regex then replacing? Is there a better way to handle this?

mikebmassey
  • 8,354
  • 26
  • 70
  • 95
  • parsing html with regex is wrong, but show more context, show html context - cause now it seems to be just a regular string and that could be "regexed" – RomanPerekhrest Aug 18 '19 at 16:04
  • 1
    You can extract the value using a single capturing group `^\['key_to_search_for'\]\['post_date'\] = '(\d{1,2} \w+ \w+)';$` See https://regex101.com/r/ee60zU/1 – The fourth bird Aug 18 '19 at 16:04
  • @RomanPerekhrest I'm using beautiful soup to parse the HTML, but beautiful soup doesn't handle inline scripts. https://stackoverflow.com/questions/38547569/how-to-use-beautiful-soup-to-extract-string-in-script-tag – mikebmassey Aug 18 '19 at 16:05

1 Answers1

1

You can extract the value using a single capturing group and match the 2 words using a quantifier for \w+.

The value is in capture group 1.

^\['key_to_search_for'\]\['post_date'\] = '(\d{1,2} \w+ \w+)';$

Regex demo

Or use a negated character class matching any char except a '

^\['key_to_search_for'\]\['post_date'\] = '([^']+)';$

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70