Regex not returning full result

Question

I have a regex, which tries to match href attribute in HTML. Href is between a script tag and in a function. I have the correct regex I think, but the result is not complete, it is cut in half.

I have tried the regex on multiple Python Regex testing sites and all give the correct result, but when tried in my own script, it gives an unfinished result.

def gotoDownload(link):
    try:
        with requests.Session().get(link) as download:
            if isUrlOnline(download):
                soup = BeautifulSoup(download.content, 'html.parser')
                filtered = soup.find_all('script')
                print(re.search(r"\'http[\s=[\s\"\']*(.*?)[\"\']*.*?\'", filtered[17].text))

The expected result of a link should be: 'http://mediafile.cloud/b34b4f6720a31f73?pt=UkhBMmVHczFaRXA2Uld4ek1qYzVWME5DYzNodVFUMDlPampsTkQ5aFNpVWxQamVlZ3REQkpEdz0%3D'

But the output is: match="'http://mediafile.cloud/b34b4f6720a31f73?pt=UkhBM

It is cut in half, ends after the =UkhBM for some reason.

Have you tested to make sure `filtered[17].text` is returning the correct text? — Matthew, Jun 03 '19 at 20:16
It is next to a typo: `re.search(pattern, string).group()` or `.group(1)`, depending on which value you need to access, the whole match or Group 1 (if you defined it). — Wiktor Stribiżew, Jun 03 '19 at 20:50

score 0 · Answer 1 · answered Jun 03 '19 at 20:17

If we wish to just get any URLs that has 'http', we would be just starting with a simple expression, such as:

('http.*?')

Demo

Test

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"('http.*?')"

test_str = ("'http://mediafile.cloud/b34b4f6720a31f73?pt=UkhBMmVHczFaRXA2Uld4ek1qYzVWME5DYzNodVFUMDlPampsTkQ5aFNpVWxQamVlZ3REQkpEdz0%3D'\n"
    "'https://mediafile.cloud/b34b4f6720a31f73?pt=UkhBMmVHczFaRXA2Uld4ek1qYzVWME5DYzNodVFUMDlPampsTkQ5aFNpVWxQamVlZ3REQkpEdz0%3D'\n"
    "'http://www.mediafile.cloud/b34b4f6720a31f73?pt=UkhBMmVHczFaRXA2Uld4ek1qYzVWME5DYzNodVFUMDlPampsTkQ5aFNpVWxQamVlZ3REQkpEdz0%3D'\n"
    "'https://www.mediafile.cloud/b34b4f6720a31f73?pt=UkhBMmVHczFaRXA2Uld4ek1qYzVWME5DYzNodVFUMDlPampsTkQ5aFNpVWxQamVlZ3REQkpEdz0%3D'")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

RegEx Circuit

jex.im visualizes regular expressions:

I went the complicated way for regex, but the return is still match="'http://mediafile.cloud/b34b4f6720a31f73?pt=VEZCa instead of the full length, when ran on my code :/ — Steb, Jun 03 '19 at 20:24
Yeah, I updated it https://regex101.com/r/RSJ5P9/2 it shows correctly there, but not in my own code — Steb, Jun 03 '19 at 20:31

score 0 · Answer 2 · answered Jun 03 '19 at 20:49

0

For some reason changing

re.match(r"('http.*?')", filtered[17].text

to

re.findall(r"('http.*?')", filtered[17].text

works :-O

answered Jun 03 '19 at 20:49

Steb

65
6

Regex not returning full result

2 Answers2

Demo

Test

RegEx Circuit