5

I have a big chunk of text that I'm checking for a specific pattern, which looks essentially like this:

     unique_options_search = new Set([
            "updates_EO_LTB",
            "us_history",
            "uslegacy",

etc., etc., etc.

        ]);

      $input.typeahead({
        source: [...unique_options_search],
        autoSelect: false,
        afterSelect: function(value) 

My text variable is named 'html_page' and my start and end points look like this:

start = "new Set(["
end = "]);"

I thought I could find what I want with this one-liner:

r = re.findall("start(.+?)end",html_page,re.MULTILINE)

However, it's not returning anything at all. What is wrong here? I saw other examples online that worked fine.

ASH
  • 20,759
  • 19
  • 87
  • 200

1 Answers1

5

There are multiple problems here.

  1. As mentioned by @EthanK in comments, "start(.+?)end" in Python is a string which describes regex which literally matches start, then something, and then literally matches end. Variables start and end do not matter here at all. You've probably meant to write start + "(.+?)" + end here instead.
  2. . in Python does not match newlines. re.MULTILINE does not matter here, it only changes behavior of ^ and $ (see docs). You should use re.DOTALL instead (see docs).
  3. Values of start and end include characters with special meaning in regex (e.g. ( and [). You have to make sure they're not treated specially. You can either escape them manually with the right number of \ or simply delegate that work to re.escape to get regular expression which literally matches what you need.

Combining all that together:

import re
html_page = """
     unique_options_search = new Set([
            "oecd_updates_EO_LTB",
            "us_history",
            "us_legacy",

etc., etc., etc.

        ]);

      $input.typeahead({
        source: [...unique_options_search],
        autoSelect: false,
        afterSelect: function(value) 
"""

start = "new Set(["
end = "]);"
# r = re.findall("start(.+?)end",html_page,re.MULTILINE)  # Old version
r = re.findall(re.escape(start) + "(.+?)" + re.escape(end), html_page, re.DOTALL)  # New version
print(r)
ASH
  • 20,759
  • 19
  • 87
  • 200
yeputons
  • 8,478
  • 34
  • 67
  • For number 3: you can use a raw string. Begin the string like this for the regexp: `r'myMatchThing...'` with a `r''` to escape. Raw strings a better to use because they are built in to python. – Eb946207 Dec 20 '18 at 21:48
  • @EthanK raw strings do not help with no. 3. Parsing of regular expression happens inside `re` library, and `r"hello(["` is exactly the same as `"hello(["` (`r` only changes meaning of stuff like `\n`, is which processed by Python parser). See [example](https://ideone.com/BvdhoV) – yeputons Dec 21 '18 at 08:26
  • Oh, right. Sorry. I got confused because I always add a raw string to my regexps. I didn’t see one so I thought one was needed. – Eb946207 Dec 21 '18 at 15:02