Scraping javascript-generated data using Python

Question

I want to scrape some data of following url using Python. http://www.hankyung.com/stockplus/main.php?module=stock&mode=stock_analysis_infomation&itemcode=078340

It's about a summary of company information.

What I want to scrape is not shown on the first page. By clicking tab named "재무제표", you can access financial statement. And clicking tab named "현금흐름표', you can access "Cash Flow".

I want to scrape the "Cash Flow" data.

However, Cash flow data is generated by javascript across the url. The following link is that url which is hidden, http://stock.kisline.com/compinfo/financial/main.action?vhead=N&vfoot=N&vstay=&omit=&vwidth=

Cash flow data is generated by submitting some option value and cookie to this url.

As you perceived, itemcode=078340 in the first link means stock code and there are as many as 1680 stocks that I want gather cash flow data. I want make it a loop structure.

Is there good way to scrape cash flow data? I tried scrapy but scrapy is difficult to cope with my another scraping code already I'm using.

Is the data pulled by ajax from server or is stored within html somehow (like within JS variable or in `data-`)? — Tadeck, Apr 07 '12 at 07:25
@luke14free it's an newspaper site. And the data is open to everyone for free, even you don't have to log in to use — trigger, Apr 07 '12 at 08:20
is this the data that you need? http://stock.kisline.com/fchart_data/resultXML/financial01/078340_G1.xml?FCTime=117 — luke14free, Apr 07 '12 at 08:27
@luke14free Probably [link]http://stock.kisline.com/compinfo/financial/financial03.action?stockcd=094970&comp=HANKYUNG&auth=1046277332331&klgubn=K&cgubun=G1[/link] is the data I think. But direct access to the link fails cause of authentication error — trigger, Apr 07 '12 at 08:37
*Do terms of service allow* ... What??? Who gives a flying leap. — Kaz, Apr 07 '12 at 16:46
Does this answer your question? [scrape html generated by javascript with python](https://stackoverflow.com/questions/2148493/scrape-html-generated-by-javascript-with-python) — user202729, Feb 12 '21 at 11:37

Niklas B. · Answer 1 · 2012-04-07T10:25:39.373

9

There's also dryscape (a library written by me, so the recommendation is a bit biased, obviously :) which uses a fast Webkit-based in-memory browser to navigate around. It understands Javascript, too, but is a lot more lightweight than Selenium.

edited Apr 07 '12 at 10:25

answered Apr 07 '12 at 10:20

Niklas B.

92,950
18
194
224

score 1 · Answer 2 · answered Apr 07 '12 at 10:16

1

If you need to scape the page content which is updated with AJAX and you are not in the control of this AJAX interface I would use Selenium browser automator for the task:

http://code.google.com/p/selenium/

Selenium has Python bindings
It launches a real browser instance so it can do and scrape 100% the same thing as you see with your own eyes
Get HTML document content after AJAX updates thru Selenium API
Use lxml + xpath / CSS selectors to parse out the relevant parts out of the document

answered Apr 07 '12 at 10:16

Mikko Ohtamaa

82,057
50
264
435

Thanks a lot. I'm gonna try selenium. – trigger Apr 08 '12 at 07:50
can i substitute jquery with this lxml +xpath part at the end (and follow the rest of the steps)? – abbood Apr 01 '13 at 12:48
Selenium comes with its own CSS selector engine (which probably uses the underlying browser), so you don't need neither jQuery nor lxml anymore – Mikko Ohtamaa Apr 01 '13 at 18:47

Scraping javascript-generated data using Python

2 Answers2

Linked