1

I have this url

http://download2142.mysite.com/d0kz4p5p3uog/api60w0g1o1jil1/upload.rar

That number that comes after 'download' is generated randomly and those two directory after 'mysite.com' get random string for each file.

I have tried with (\.rar$) to find the file extension to download the file, but the problem is that there are other link on that page that have urls that end with .rar file and it is not the actual download link. So finding the download link by extension does not help here. I need a pattern like below.

http://download\[random_no_here\].mysite.com/\[randomstring_number_included here/\[another_randomstring_with_number_included_here/the_actual_file_here_with_random_name.rar

Andy
  • 49,085
  • 60
  • 166
  • 233
Zip
  • 5,372
  • 9
  • 28
  • 39

1 Answers1

0

This regex will do what you want:

r'http://download\d+\.mysite\.com/\w+/\w+/upload\.rar'

\d matches digits, \w matches alphanumerics (including underscore); the + says to match one or more of the previous pattern. We use a \ in front of .com and .rar so that the . is interpreted literally and not as a regex wildcard.

test

import re

p = re.compile(r'http://download\d+\.mysite\.com/\w+/\w+/upload\.rar')

table = [
    'http://download2142.mysite.com/d0kz4p5p3uog/api60w0g1o1jil1/upload.rar',
    'http://download2142.mysite.com/d0kz4p5p3uog/api60w0g1o1jil1/upload.raw',
    'http://download123.mysite.com/456/789/upload.rar',
    'http://downloadabc.mysite.com/def/ghi/upload.rar',
    'http://download1234.mysite.com/def/ghi/upload.rar',
    'http://download1234.mysite.org/def/ghi/upload.rar',
]

for s in table:
    m = p.match(s)
    print s, m is not None

output

http://download2142.mysite.com/d0kz4p5p3uog/api60w0g1o1jil1/upload.rar True
http://download2142.mysite.com/d0kz4p5p3uog/api60w0g1o1jil1/upload.raw False
http://download123.mysite.com/456/789/upload.rar True
http://downloadabc.mysite.com/def/ghi/upload.rar False
http://download1234.mysite.com/def/ghi/upload.rar True
http://download1234.mysite.org/def/ghi/upload.rar False

If the actual file name varies then you can use

r'http://download\d+\.mysite\.com/\w+/\w+/\w+\.rar'

or

r'http://download\d+\.mysite\.com/\w+/\w+/[a-z]+\.rar'

if the name will always be lowercase letters


BTW, it's generally not a good idea to parse HTML with regex, but if the page format is fixed and fairly simple you may be able to get away with it.

Community
  • 1
  • 1
PM 2Ring
  • 54,345
  • 6
  • 82
  • 182
  • I am using this to find the link in python and beautifulsoup. ```link2 = soup2.findAll(href=re.compile(''http://download\d+\.mysite\.com/\w+/\w+/[a-z]+\.rar''))``` but did not find the link. – Zip Mar 22 '15 at 03:43