0

Consider the string:

<p class="sm clg" data-rlocation="Uttam Nagar East">Uttam Nagar East, Delhi <span class="to-txt" id="citytt1">B-24, East Uttam Nagar, Uttam Nagar East,<br>Delhi<span> - </span>110059

I want to get the result Uttam Nagar East using a regex function, but the output I'm getting is

Uttam Nagar East">Uttam Nagar East, Delhi <span class="to-txt" id="citytt1'

I've tried using

print(re.findall(r'data-rlocation="(.*)["]',contents))

and

print(re.findall(r'data-rlocation="(.*)"',contents))
Zach Gates
  • 4,045
  • 1
  • 27
  • 51
saurabh
  • 47
  • 1
  • 7
  • `.` matches everything, so even the closing quote will be matched. Try `print(re.findall(r'data-rlocation="([^"]*)"',contents))` My change: `[^"]` matches everything except quotes, so it won't match past the end of your "Nagar East" string – liamdiprose Sep 12 '19 at 01:01

5 Answers5

3

The group (.*) is including the closing quotes in its capture. Try this instead:

>>> re.findall(r'data-rlocation="([^"]*)"', contents)
['Uttam Nagar East']

Check out how it works here.

Zach Gates
  • 4,045
  • 1
  • 27
  • 51
1

By default, * is greedy, which means that it tries to consume as many characters as possible. If you'd rather match as few characters as possible, you can use the non-greedy qualifier *? instead:

print(re.findall(r'data-rlocation="(.*?)"',contents))

More information: https://docs.python.org/3.5/howto/regex.html#greedy-versus-non-greedy

mackorone
  • 1,056
  • 6
  • 15
  • That being said, you probably shouldn't be using regex to parse HTML. See https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – mackorone Sep 12 '19 at 01:05
1

you are using greedy regex you can add '?' to make it non greedy

import re
contents = '<p class="sm clg" data-rlocation="Uttam Nagar East">Uttam Nagar East, Delhi <span class="to-txt" id="citytt1">B-24, East Uttam Nagar, Uttam Nagar East,<br>Delhi<span> - </span>110059'
print(re.findall(r'data-rlocation="(.*?)"',contents))
Dev Khadka
  • 5,142
  • 4
  • 19
  • 33
1

A positive lookbehind and positive lookahead with a lazy match will do the trick.

Pattern: (?<=data-rlocation=").*?(?=")

Code: print(re.findall(r'(?<=data-rlocation=").*?(?=")',contents))

Demo on regex101

Explanation

  • (?<= use a positive lookahead. It will not return the string. It will only make sure that this pattern is right before the match.
    • data-rlocation=" this is the string to match
  • ) close the positive lookahead
  • .* match every single character of the string we want to return
  • ? make the * lazy (not greedy)
  • (?= open a positive lookahead to match the closing pattern but don't return the string
    • " match the next double quote
  • ) close the positive lookahead
Aleksandar
  • 1,496
  • 1
  • 18
  • 35
0

Maybe, find_all from bs4 might return the desired output:

from bs4 import BeautifulSoup

line = '<p class="sm clg" data-rlocation="Uttam Nagar East">Uttam Nagar East, Delhi <span class="to-txt" id="citytt1">B-24, East Uttam Nagar, Uttam Nagar East,<br>Delhi<span> - </span>110059'
soup = BeautifulSoup(line, 'html.parser')

for l in soup.find_all('p'):
    print(l['data-rlocation'])

Output

Uttam Nagar East

If not, maybe

(?i)data-rlocation="([^\r\n"]*)"

with re.findall might be another option.

import re

expression = r'(?i)data-rlocation="([^\r\n"]*)"'

string = """
<p class="sm clg" data-rlocation="Uttam Nagar East">Uttam Nagar East, Delhi <span class="to-txt" id="citytt1">B-24, East Uttam Nagar, Uttam Nagar East,<br>Delhi<span> - </span>110059
"""

print(re.findall(expression, string))

Output

['Uttam Nagar East']

If you wish to explore/simplify/modify the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.


Emma
  • 27,428
  • 11
  • 44
  • 69