1

I am writing a regex to grab data between "". The only issue I am running into is the last " is being captured. Regex

  line = '<DT><A HREF="https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html" ADD_DATE="1567455957">Clickjacking Defense · OWASP Cheat Sheet Series</A>'
  capture_regex = re.compile(r'(?<=HREF=").*?"',re.IGNORECASE)
  m = capture_regex.search(line)

m.group() prints https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html". How to write the regex where it does not include the last quotation mark.

Answered my question. I added I added what is called non-greedy to my regex. capture_regex = re.compile(r'(?<=HREF=").*?(?=")',re.IGNORECASE). By adding the ? after * made it only stop at the first ".

newdeveloper
  • 534
  • 3
  • 17

3 Answers3

2

Maybe, find_all from bs4 might work OK:

from bs4 import BeautifulSoup

line = '<DT><A HREF="https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html" ADD_DATE="1567455957">Clickjacking Defense · OWASP Cheat Sheet Series</A>'
soup = BeautifulSoup(line, 'html.parser')

for l in soup.find_all('a', href=True):
    print(l['href'])

Output

https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html

If not, maybe, some expression similar to

(?i)href="\s*([^\s"]*?)\s*"

with re.findall might be working here:

import re

expression = r'(?i)href="\s*([^\s"]*?)\s*"'

string = """
<DT><A HREF="https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html" ADD_DATE="1567455957">Clickjacking Defense · OWASP Cheat Sheet Series</A>
<DT><A HREF=" https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html " ADD_DATE="1567455957">Clickjacking Defense · OWASP Cheat Sheet Series</A>
"""

print(re.findall(expression, string))

Output

['https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html', 'https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html']

If you wish to explore/simplify/modify the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.


Emma
  • 27,428
  • 11
  • 44
  • 69
1
capture_regex = re.compile(r'(?<=HREF=").*?(?=")',re.IGNORECASE)

working fiddle

Edit: Adjusted the regex as it was too greedy. Thanks to @newdeveloper for pointing it out!

Aleksandar
  • 1,496
  • 1
  • 18
  • 35
1

This will work:

import re

line = '<DT><A HREF="https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html" ADD_DATE="1567455957">Clickjacking Defense · OWASP Cheat Sheet Series</A>'

capture_regex = re.compile(r'(?<=HREF=")([^"]*)(?:")',re.IGNORECASE)
# capture_regex = re.compile(r'(?:HREF=")([^"]*)(?:")',re.IGNORECASE) this will work too
print(capture_regex.search(line).groups())
# print(capture_regex.findall(line))  # if your text contains more than one HREF

Out put:

  ['https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html']
Charif DZ
  • 14,415
  • 3
  • 21
  • 40