0

I'm attempting to scrape a website and I need to get at an embed element, but because I'm using Python and lxml.html the website accurately concludes that I do not have Flash installed and instead of showing me the embed element, it shows me this:

<div>
    <font>
        <u>
            <b>
                <a href="http://get.adobe.com/flashplayer/">
                ATTENTION:<br>This video will not play. You currently do not have Adobe Flash installed on this computer. Please click here to download it (it's free!)
                </a>
            </b>
        </u>
    </font>
</div>

Obviously that is a problem, so I'm wondering if it is at all possible to trick the browser into thinking you have Flash installed even though you don't, for the purposes of retrieving the right element?

I hope someone can help!

Atheuz
  • 321
  • 1
  • 3
  • 15
  • 1
    Is that section replaced by some client-side javascript with the actual `` at load? – sarnold Jul 03 '12 at 23:47
  • Where s1 is: – Atheuz Jul 03 '12 at 23:53
  • 1
    you don't need to comment on your own question, you could [update it](http://stackoverflow.com/posts/11320687/edit) instead – jfs Jul 04 '12 at 00:14

2 Answers2

0

I believe the following blog post answers your question well. The author had the same need, to scrape Flash content using Python. And the same problem came up. He realized that he just needed to instantiate a browser (even just an in-memory one that did not even display to the screen) and then scrape its output. I think this could be a successful approach for what you need, and he makes it easy to understand.

http://blog.motane.lu/2009/06/18/pywebkitgtk-execute-javascript-from-python/

cdaddr
  • 1,330
  • 10
  • 9
  • Is there a way to install pywebkitgtk on Windows? Because I'm not finding any, except versions that will not work on Python 2.7. – Atheuz Jul 04 '12 at 02:36
  • I'm going to accept your answer because it lead me to something that is partly an answer, though still doesn't work. Specifically using PyQt4 QtWebKit can run on Windows and render webpages in memory, but there are unrelated issues I need to resolve. – Atheuz Jul 04 '12 at 23:07
  • OK, thank you! Thanks for reporting back, and I hope you get the whole system you need working. Post about it if you do... – cdaddr Jul 05 '12 at 02:52
0

To get content generated by JavaScript you could also try Selenium, example.

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670