I am writing a regex to grab data between ""
. The only issue I am running into is the last "
is being captured. Regex
line = '<DT><A HREF="https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html" ADD_DATE="1567455957">Clickjacking Defense · OWASP Cheat Sheet Series</A>'
capture_regex = re.compile(r'(?<=HREF=").*?"',re.IGNORECASE)
m = capture_regex.search(line)
m.group()
prints https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html"
. How to write the regex where it does not include the last quotation mark.
Answered my question. I added I added what is called non-greedy to my regex.
capture_regex = re.compile(r'(?<=HREF=").*?(?=")',re.IGNORECASE)
. By adding the ?
after *
made it only stop at the first "
.