-1

I have a regex, which tries to match href attribute in HTML. Href is between a script tag and in a function. I have the correct regex I think, but the result is not complete, it is cut in half.

I have tried the regex on multiple Python Regex testing sites and all give the correct result, but when tried in my own script, it gives an unfinished result.

def gotoDownload(link):
    try:
        with requests.Session().get(link) as download:
            if isUrlOnline(download):
                soup = BeautifulSoup(download.content, 'html.parser')
                filtered = soup.find_all('script')
                print(re.search(r"\'http[\s=[\s\"\']*(.*?)[\"\']*.*?\'", filtered[17].text))

The expected result of a link should be: 'http://mediafile.cloud/b34b4f6720a31f73?pt=UkhBMmVHczFaRXA2Uld4ek1qYzVWME5DYzNodVFUMDlPampsTkQ5aFNpVWxQamVlZ3REQkpEdz0%3D'

But the output is: match="'http://mediafile.cloud/b34b4f6720a31f73?pt=UkhBM

It is cut in half, ends after the =UkhBM for some reason.

Steb
  • 65
  • 6

2 Answers2

0

If we wish to just get any URLs that has 'http', we would be just starting with a simple expression, such as:

('http.*?')

Demo

Test

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"('http.*?')"

test_str = ("'http://mediafile.cloud/b34b4f6720a31f73?pt=UkhBMmVHczFaRXA2Uld4ek1qYzVWME5DYzNodVFUMDlPampsTkQ5aFNpVWxQamVlZ3REQkpEdz0%3D'\n"
    "'https://mediafile.cloud/b34b4f6720a31f73?pt=UkhBMmVHczFaRXA2Uld4ek1qYzVWME5DYzNodVFUMDlPampsTkQ5aFNpVWxQamVlZ3REQkpEdz0%3D'\n"
    "'http://www.mediafile.cloud/b34b4f6720a31f73?pt=UkhBMmVHczFaRXA2Uld4ek1qYzVWME5DYzNodVFUMDlPampsTkQ5aFNpVWxQamVlZ3REQkpEdz0%3D'\n"
    "'https://www.mediafile.cloud/b34b4f6720a31f73?pt=UkhBMmVHczFaRXA2Uld4ek1qYzVWME5DYzNodVFUMDlPampsTkQ5aFNpVWxQamVlZ3REQkpEdz0%3D'")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

RegEx Circuit

jex.im visualizes regular expressions:

enter image description here

Emma
  • 27,428
  • 11
  • 44
  • 69
  • I went the complicated way for regex, but the return is still match="'http://mediafile.cloud/b34b4f6720a31f73?pt=VEZCa instead of the full length, when ran on my code :/ – Steb Jun 03 '19 at 20:24
  • 1
    Yeah, I updated it https://regex101.com/r/RSJ5P9/2 it shows correctly there, but not in my own code – Steb Jun 03 '19 at 20:31
0

For some reason changing

re.match(r"('http.*?')", filtered[17].text

to

re.findall(r"('http.*?')", filtered[17].text

works :-O

Steb
  • 65
  • 6