end line is not parsing correctly with re library python

Question

Consider the string:

<p class="sm clg" data-rlocation="Uttam Nagar East">Uttam Nagar East, Delhi <span class="to-txt" id="citytt1">B-24, East Uttam Nagar, Uttam Nagar East,<br>Delhi<span> - </span>110059

I want to get the result Uttam Nagar East using a regex function, but the output I'm getting is

Uttam Nagar East">Uttam Nagar East, Delhi <span class="to-txt" id="citytt1'

I've tried using

print(re.findall(r'data-rlocation="(.*)["]',contents))

and

print(re.findall(r'data-rlocation="(.*)"',contents))

`.` matches everything, so even the closing quote will be matched. Try `print(re.findall(r'data-rlocation="([^"]*)"',contents))` My change: `[^"]` matches everything except quotes, so it won't match past the end of your "Nagar East" string — liamdiprose, Sep 12 '19 at 01:01

score 3 · Answer 1 · answered Sep 12 '19 at 01:02

3

The group (.*) is including the closing quotes in its capture. Try this instead:

>>> re.findall(r'data-rlocation="([^"]*)"', contents)
['Uttam Nagar East']

Check out how it works here.

answered Sep 12 '19 at 01:02

Zach Gates

4,045
1
27
51

score 1 · Answer 2 · answered Sep 12 '19 at 01:04

1

By default, * is greedy, which means that it tries to consume as many characters as possible. If you'd rather match as few characters as possible, you can use the non-greedy qualifier *? instead:

print(re.findall(r'data-rlocation="(.*?)"',contents))

More information: https://docs.python.org/3.5/howto/regex.html#greedy-versus-non-greedy

answered Sep 12 '19 at 01:04

mackorone

1,056
6
15

That being said, you probably shouldn't be using regex to parse HTML. See https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – mackorone Sep 12 '19 at 01:05

score 1 · Answer 3 · answered Sep 12 '19 at 01:05

you are using greedy regex you can add '?' to make it non greedy

import re
contents = '<p class="sm clg" data-rlocation="Uttam Nagar East">Uttam Nagar East, Delhi <span class="to-txt" id="citytt1">B-24, East Uttam Nagar, Uttam Nagar East,<br>Delhi<span> - </span>110059'
print(re.findall(r'data-rlocation="(.*?)"',contents))

Aleksandar · Answer 4 · 2019-09-12T03:59:16.340

A positive lookbehind and positive lookahead with a lazy match will do the trick.

Pattern: (?<=data-rlocation=").*?(?=")

Code: print(re.findall(r'(?<=data-rlocation=").*?(?=")',contents))

Demo on regex101

Explanation

(?<= use a positive lookahead. It will not return the string. It will only make sure that this pattern is right before the match.
- data-rlocation=" this is the string to match
) close the positive lookahead
.* match every single character of the string we want to return
? make the * lazy (not greedy)
(?= open a positive lookahead to match the closing pattern but don't return the string
- " match the next double quote
) close the positive lookahead

Emma · Answer 5 · 2019-09-12T04:07:26.983

Maybe, find_all from bs4 might return the desired output:

from bs4 import BeautifulSoup

line = '<p class="sm clg" data-rlocation="Uttam Nagar East">Uttam Nagar East, Delhi <span class="to-txt" id="citytt1">B-24, East Uttam Nagar, Uttam Nagar East,<br>Delhi<span> - </span>110059'
soup = BeautifulSoup(line, 'html.parser')

for l in soup.find_all('p'):
    print(l['data-rlocation'])

Output

Uttam Nagar East

If not, maybe

(?i)data-rlocation="([^\r\n"]*)"

with re.findall might be another option.

import re

expression = r'(?i)data-rlocation="([^\r\n"]*)"'

string = """
<p class="sm clg" data-rlocation="Uttam Nagar East">Uttam Nagar East, Delhi <span class="to-txt" id="citytt1">B-24, East Uttam Nagar, Uttam Nagar East,<br>Delhi<span> - </span>110059
"""

print(re.findall(expression, string))

Output

['Uttam Nagar East']

If you wish to explore/simplify/modify the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

end line is not parsing correctly with re library python

5 Answers5

Output

Output