Greedy regex lookbehind

Question

I am writing a regex to grab data between "". The only issue I am running into is the last " is being captured. Regex

  line = '<DT><A HREF="https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html" ADD_DATE="1567455957">Clickjacking Defense · OWASP Cheat Sheet Series</A>'
  capture_regex = re.compile(r'(?<=HREF=").*?"',re.IGNORECASE)
  m = capture_regex.search(line)

m.group() prints https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html". How to write the regex where it does not include the last quotation mark.

Answered my question. I added I added what is called non-greedy to my regex. capture_regex = re.compile(r'(?<=HREF=").*?(?=")',re.IGNORECASE). By adding the ? after * made it only stop at the first ".

You should avoid using regex to parse HTML files. `bs4` should be used instead. — DYZ, Sep 10 '19 at 04:22
`(?=")` Looks for the last `"`. `bs4` would work, I am trying to improve me regex skills. — newdeveloper, Sep 10 '19 at 08:49

Emma · Answer 1 · 2019-09-10T04:18:47.780

Maybe, find_all from bs4 might work OK:

from bs4 import BeautifulSoup

line = '<DT><A HREF="https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html" ADD_DATE="1567455957">Clickjacking Defense · OWASP Cheat Sheet Series</A>'
soup = BeautifulSoup(line, 'html.parser')

for l in soup.find_all('a', href=True):
    print(l['href'])

Output

https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html

If not, maybe, some expression similar to

(?i)href="\s*([^\s"]*?)\s*"

with re.findall might be working here:

import re

expression = r'(?i)href="\s*([^\s"]*?)\s*"'

string = """
<DT><A HREF="https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html" ADD_DATE="1567455957">Clickjacking Defense · OWASP Cheat Sheet Series</A>
<DT><A HREF=" https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html " ADD_DATE="1567455957">Clickjacking Defense · OWASP Cheat Sheet Series</A>
"""

print(re.findall(expression, string))

Output

['https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html', 'https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html']

If you wish to explore/simplify/modify the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

Aleksandar · Accepted Answer · 2019-09-10T13:14:38.933

1

capture_regex = re.compile(r'(?<=HREF=").*?(?=")',re.IGNORECASE)

working fiddle

Edit: Adjusted the regex as it was too greedy. Thanks to @newdeveloper for pointing it out!

edited Sep 10 '19 at 13:14

answered Sep 10 '19 at 01:26

Aleksandar

1,496
1
18
35

1

This is capturing more data than needed. `https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html" ADD_DATE="1567455957` – newdeveloper Sep 10 '19 at 08:39
@newdeveloper: How stupid I missed that. It is corrected now... – Aleksandar Sep 10 '19 at 13:15
Yeah I figured it out thanks. – newdeveloper Sep 10 '19 at 19:43

Charif DZ · Answer 3 · 2019-09-10T08:25:40.607

This will work:

import re

line = '<DT><A HREF="https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html" ADD_DATE="1567455957">Clickjacking Defense · OWASP Cheat Sheet Series</A>'

capture_regex = re.compile(r'(?<=HREF=")([^"]*)(?:")',re.IGNORECASE)
# capture_regex = re.compile(r'(?:HREF=")([^"]*)(?:")',re.IGNORECASE) this will work too
print(capture_regex.search(line).groups())
# print(capture_regex.findall(line))  # if your text contains more than one HREF

Out put:

  ['https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html']

Greedy regex lookbehind

3 Answers3

Output

Output