I`m trying to download content of a website using python urllib, but i have a problem because the site has an addblock filter and only thing i can get is text that asks me to disable addblock... Is there any way to trick this kind of filter? Thanks in advance. (:
-
1Add some code in the post so that we can help you out. – squiroid Feb 22 '15 at 20:32
1 Answers
Javascript Parsing
The issue you are running into is a JavaScript filter that loads data after the page has loaded. The message that warns that you are using adblock is there in raw HTML and is completely static. It is replaced when a JavaScript call is able to validate where adblock is or is not present. There are several ways you can get around this, however each requires finding some way of loading JavaScript.
Solution(s)
There are several solutions to your problem. You can read more about them here.
- Embed a web browser within an application and simulate a normal user.
- Remotely connect to a web browser and automate it from a scripting language.
- Use special purpose add-ons to automate the browser
- Use a framework/library to simulate a complete browser.
As you can see each one in some way requires emulating a browser and DOM objects. Since there are several libraries to help you accomplish this, I highly recommend you look into the url above.
The following is a code example from the same page that shows how to retrieve the URLs on a page that generates URLs via JavaScript. It relies on a library from gargoylesoftware.
import com.gargoylesoftware.htmlunit.WebClient as WebClient
import com.gargoylesoftware.htmlunit.BrowserVersion as BrowserVersion
def main():
webclient = WebClient(BrowserVersion.FIREFOX_3_6) # creating a new webclient object.
url = "http://www.gartner.com/it/products/mq/mq_ms.jsp"
page = webclient.getPage(url) # getting the url
articles = page.getByXPath("//table[@id='mqtable']//tr/td/a") # getting all the hyperlinks
if __name__ == '__main__':
main()
However,
I am not sure why you are scraping a webpage, or what website you are scraping it from. However, it is against the terms and conditions of various sites to automate such data-collection, and I advise your revise these terms before you get yourself into any trouble.
Further Research
If you are looking for a more generic answer to your question (e.g. "How can I load javascript with Python.") I highly recommend looking at previous answers on this site, because they offer some really good insight into the matter:

- 1
- 1

- 682
- 1
- 4
- 23
-
-
No problem. Feel free to accept it as the answer if it helped you. Helps people like me not get excited when we see an unanswered question :P – Daymon Schroeder Feb 23 '15 at 06:16