0

I wrote a python program to download a file from the internet :

 url = "http://download2163.mediafire.com/icum151v51zg/55rll9s5ioshz5n/Alcohol52_FE_2-0-3-6850.exe"
 file_name ='file'
 u = urllib2.urlopen(url)
 f = open(file_name, 'wb')
 buffer = u.read()
 f.write(buffer)
 f.close()

And it work correctly. The problem is that in this program the link that is used to download the file is not costant ! The file that i want to download was been uploaded using mediafire. I found out that the link of this page (http://www.mediafire.com/download/55rll9s5ioshz5n/Alcohol52_FE_2-0-3-6850.exe) is costant, and in this page I found the link that i put in my program. Infact by clicking on the button "download (6.77 MB)" with the right button of my mouse and selecting "gain this link" , I gained the direct link that I used in my program : http://download2163.mediafire.com/icum151v51zg/55rll9s5ioshz5n/Alcohol52_FE_2-0-3-6850.exe

But this second direct link - that is the direct link I really need - is variable!

I have found the way to gain this variable and important direct link : using the first and costant link(http://www.mediafire.com/download/55rll9s5ioshz5n/Alcohol52_FE_2-0-3-6850.exe) I downloaded the HTML page, and inside of this HTML file I found the direct link that I needed!

The problem is: sometimes when my python program try to download the HTML page it download the right page that contain the direct link, but sometime it download wrong one, with the captcha! So the direct link can not be founded.

I' m looking for a way to avoid this captcha and to be sure that my python program Always download the correct HTML page with the direct link inside !

Any suggestions ?


If there isn't any way, does anyone know how can I gain the direct link of a file that I want upload on the internet and that I want to be downloaded by my python program ?

VinceLomba
  • 398
  • 2
  • 8
  • 18
  • 1
    You _could_ consider using their [API](https://www.mediafire.com/developers/core_api/unversioned/download/#direct_download_link)... – PM 2Ring Jul 24 '15 at 16:16
  • What do you mean whit " You could consider using API ? " – VinceLomba Jul 25 '15 at 18:06
  • Web pages are intended to be accessed by humans. If lots of people scrape a site with scripts it can put a strain on the server. So like many Web sites, MediaFire use things like redirects & captcha to make it hard for such scripts. However, they provide an interface (a Web API) which approved software can use to access their data efficiently & fairly. Generally speaking, basic use of a site's API is free, but you can pay a fee to get high volume access for commercial usage. – PM 2Ring Jul 26 '15 at 06:20

2 Answers2

1

You Can Use Direct Download module from here for mediafire

simply download module by pip

pip install Direct-Download

then

from Direct_Download import Direct

url = Direct()

url.mediafire('link')
CYCNO
  • 88
  • 8
0

You'll need to look for something in the page that you can use to always identify the link. For example, the download link is in a div element with class "download-link". You can parse the HTML for that div, then grab the link from it's child element. There are other possibilities too. For example, you could look for something unique and constant in the URL of interest and use a regular expression to select for it after grabbing all links from the page.

I'd highly recommend looking into the BeautifulSoup library, which will allow you to easily parse HTML.

EDIT: Okay, I didn't notice this because I initially looked at the page in my browser, but apparently mediafire only populates the download div with javascript after the page has loaded, which makes scraping the link a lot harder. Thankfully, they still have to include the download link and using an ugly, hideous little hack we can grab it:

First, you'll need this regex for URLs: http://daringfireball.net/2010/07/improved_regex_for_matching_urls

Then grab the page contents and parse with beautifulsoup as such:

soup = BeautifulSoup(page)
div_tag = soup.find_all(class_="download_link")[0]
script_tag = div_tag("script")[0]
link = re.findall(regex, script_tag.contents[0])[0]

Here's my whole working code:

import requests
import re
from bs4 import BeautifulSoup

pre_regex = r"""(?xi)
\b
(                           # Capture 1: entire matched URL
  (?:
    [a-z][\w-]+:                # URL protocol and colon
    (?:
      /{1,3}                        # 1-3 slashes
      |                             #   or
      [a-z0-9%]                     # Single letter or digit or '%'
                                    # (Trying not to match e.g. "URI::Escape")
    )
    |                           #   or
    www\d{0,3}[.]               # "www.", "www1.", "www2." ... "www999."
    |                           #   or
    [a-z0-9.\-]+[.][a-z]{2,4}/  # looks like domain name followed by a slash
  )
  (?:                           # One or more:
    [^\s()<>]+                      # Run of non-space, non-()<>
    |                               #   or
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
  )+
  (?:                           # End with:
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
    |                                   #   or
    [^\s`!()\[\]{};:'".,<>?]        # not a space or one of these punct chars
  )
)"""
regex = re.compile(pre_regex)

url = "http://www.mediafire.com/download/raju14e8aq6azbo/Getting+Started+with+MediaFire.pdf"
s = requests.session()
result = s.get(url)
soup = BeautifulSoup(result.content)

div_tag = soup.find_all(class_="download_link")[0]
script_tag = div_tag("script")[0]
link = re.findall(regex, script_tag.contents[0])[0][0]

print link
Ryan Murray
  • 175
  • 2
  • 7
  • The hard part is waiting for the "Preparing download..." text to go away and reveal the download link ;) – heinst Jul 24 '15 at 16:16
  • It gives me an error : TypeError: 'module' object is not callable – VinceLomba Jul 24 '15 at 17:40
  • It works for me. I know the regex in that URL has some issues with non-ascii characters being mixed and has to be cleaned first if you copy/paste it, but the error you're describing sounds like you're not importing or calling something correctly. I imagine this is coming from BeautifulSoup, yes? You should be importing it as "from bs4 import BeautifulSoup" for that snippet to work. – Ryan Murray Jul 24 '15 at 18:31
  • 1. I don' t understand the function of pre_regex - What is module bs4 ? I use : import BeautifulSoup - 2. It gives error again : Traceback (most recent call last): File "C:\Users\Vincenzo\Desktop\down.py", line 37, in soup = BeautifulSoup(result.content) TypeError: 'module' object is not callable – VinceLomba Jul 24 '15 at 18:52
  • The pre_regex is there because you first specify the regex, then compile it. I just call it pre_regex because it hasn't been compiled yet at that stage. As for why you need to import it using the from * import * syntax, read this http://stackoverflow.com/questions/9439480/from-import-vs-import The short of it is that you don't have to, but then you'd have to refer it as bs4.BeautifulSoup. The Beautiful Soup documentation says to go with "from bs4 import BeautifulSoup" so that's what I do. – Ryan Murray Jul 24 '15 at 19:04
  • I 've solved my problems and it work for me, BUT : why doesn't it work with this URL : http://www.mediafire.com/download/csklbff34i0rlqc/rzr-skrm.iso ? - It says : `Traceback (most recent call last): File "C:\Users\Vincenzo\Desktop\beautifulsoup4-4.3.2\ehi !.py", line 39, in div_tag = soup.find_all(class_="download_link")[0] IndexError: list index out of range` – VinceLomba Jul 25 '15 at 18:03