7

I want to scrape some data of following url using Python. http://www.hankyung.com/stockplus/main.php?module=stock&mode=stock_analysis_infomation&itemcode=078340

It's about a summary of company information.

What I want to scrape is not shown on the first page. By clicking tab named "재무제표", you can access financial statement. And clicking tab named "현금흐름표', you can access "Cash Flow".

I want to scrape the "Cash Flow" data.

However, Cash flow data is generated by javascript across the url. The following link is that url which is hidden, http://stock.kisline.com/compinfo/financial/main.action?vhead=N&vfoot=N&vstay=&omit=&vwidth=

Cash flow data is generated by submitting some option value and cookie to this url.

As you perceived, itemcode=078340 in the first link means stock code and there are as many as 1680 stocks that I want gather cash flow data. I want make it a loop structure.

Is there good way to scrape cash flow data? I tried scrapy but scrapy is difficult to cope with my another scraping code already I'm using.

trigger
  • 91
  • 1
  • 2
  • 6
  • Is the data pulled by ajax from server or is stored within html somehow (like within JS variable or in `data-`)? – Tadeck Apr 07 '12 at 07:25
  • 1
    Do terms of service allow you to do that? – luke14free Apr 07 '12 at 08:07
  • Tadeck, the data is pulled from server. – trigger Apr 07 '12 at 08:14
  • @luke14free it's an newspaper site. And the data is open to everyone for free, even you don't have to log in to use – trigger Apr 07 '12 at 08:20
  • is this the data that you need? http://stock.kisline.com/fchart_data/resultXML/financial01/078340_G1.xml?FCTime=117 – luke14free Apr 07 '12 at 08:27
  • @luke14free Probably [link]http://stock.kisline.com/compinfo/financial/financial03.action?stockcd=094970&comp=HANKYUNG&auth=1046277332331&klgubn=K&cgubun=G1[/link] is the data I think. But direct access to the link fails cause of authentication error – trigger Apr 07 '12 at 08:37
  • 1
    *Do terms of service allow* ... What??? Who gives a flying leap. – Kaz Apr 07 '12 at 16:46
  • Does this answer your question? [scrape html generated by javascript with python](https://stackoverflow.com/questions/2148493/scrape-html-generated-by-javascript-with-python) – user202729 Feb 12 '21 at 11:37

2 Answers2

9

There's also dryscape (a library written by me, so the recommendation is a bit biased, obviously :) which uses a fast Webkit-based in-memory browser to navigate around. It understands Javascript, too, but is a lot more lightweight than Selenium.

Niklas B.
  • 92,950
  • 18
  • 194
  • 224
1

If you need to scape the page content which is updated with AJAX and you are not in the control of this AJAX interface I would use Selenium browser automator for the task:

http://code.google.com/p/selenium/

  • Selenium has Python bindings

  • It launches a real browser instance so it can do and scrape 100% the same thing as you see with your own eyes

  • Get HTML document content after AJAX updates thru Selenium API

  • Use lxml + xpath / CSS selectors to parse out the relevant parts out of the document

Mikko Ohtamaa
  • 82,057
  • 50
  • 264
  • 435