Iterate pages and crawler HTML tables in Python

Question

I try to crawler the tables from this link, I have get position of table content by using F12 inspect.

I have use the follow code, but I get None result, someone could help? Thanks.

import requests
import json
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

url = 'http://bjjs.zjw.beijing.gov.cn/eportal/ui?pageId=308894'

website_url = requests.get(url).text
soup = BeautifulSoup(website_url, 'lxml')
table = soup.find('table', {'class': 'gridview'})
#table = soup.find('table', {'class': 'criteria'})

print(table)

Please also check this reference, in fact, I want do the similar things here, but the web structure seems different.

Updated: The following code works for one page, but I need to loop other pages as well.

import requests
import json
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

url = 'http://bjjs.zjw.beijing.gov.cn/eportal/ui?pageId=308894'

website_url = requests.get(url).text
soup = BeautifulSoup(website_url, 'lxml')
table = soup.find('table', {'class': 'gridview'})
#https://stackoverflow.com/questions/51090632/python-excel-export
df = pd.read_html(str(table))[0]
df.to_excel('test.xlsx', index = False)

Output:

   序号  ...      竣工备案日期
0   1  ...  2020-01-22
1   2  ...  2020-01-22
2   3  ...  2020-01-22
3   4  ...  2020-01-22
4   5  ...  2020-01-22

[5 rows x 9 columns]

Reference related:

https://medium.com/analytics-vidhya/web-scraping-wiki-tables-using-beautifulsoup-and-python-6b9ea26d8722

Are you getting error or `print(table)` prints nothing ? When I try your code, It prints table elements — Omer Tekbiyik, Jan 23 '20 at 02:50
By the way, there is typo in `websit_url = requests.get(url).text` — Omer Tekbiyik, Jan 23 '20 at 02:56
Sorry, I corrected it, now it print the table's contents. Any ideas how can I extract them? — ah bon, Jan 23 '20 at 03:03
Can you check the answer ? Please Tell me if its true. Because I cant reach the website @ahbon — Omer Tekbiyik, Jan 23 '20 at 03:23
It's weird. Please try with this site, it's similar case. http://zfcj.gz.gov.cn/data/QueryService/Query.aspx?QueryID=6 — ah bon, Jan 23 '20 at 03:27

Omer Tekbiyik · Accepted Answer · 2020-01-23T11:57:16.877

1

You can get elements in <tr>... </tr> tags like :

table = soup.find_all('table', {'class': 'gridview'})

for elements in table:
    inner_elements = elements.findAll('tr')[1:]

    for text_for_elements in inner_elements:
        print(text_for_elements.text)

OUTPUT :

1
朝阳区东三环北路38号院4号楼3层301室内局部装修工程
威沃克办公服务（北京）有限公司
袁永懿
上海东园建筑装饰有限公司
陈振华
0065朝竣2020(装)0053号
北京市朝阳区住房和城乡建设委员会
2020-01-22
2
北京市朝阳区新源南路3号14层04单元A1704室内装修工程
重庆金融资产交易所有限责任公司
罗珊珊
深圳安星建设集团有限公司
张惠富
0066朝竣2020(装)0054号
北京市朝阳区住房和城乡建设委员会
2020-01-22
......

edited Jan 23 '20 at 11:57

answered Jan 23 '20 at 03:22

Omer Tekbiyik

4,255
1
15
27

Thank you, please check my update. Do you know how to iterate other pages? – ah bon Jan 23 '20 at 03:23
If you want to change pages by clicking to pagging with code, You can use Python Selenium and webdriver. There are some examples on internet. I cant reach your website so I cant show – Omer Tekbiyik Jan 23 '20 at 03:28
You can't open this one as well? http://zfcj.gz.gov.cn/data/QueryService/Query.aspx?QueryID=6 – ah bon Jan 23 '20 at 03:29
Sorry, Its blocked on my country. – Omer Tekbiyik Jan 23 '20 at 03:30
Ok, I try it generates: `AttributeError: 'NavigableString' object has no attribute 'findAll'`. – ah bon Jan 23 '20 at 03:37
How to convert it to dataframe? – ah bon Jan 23 '20 at 04:51

Iterate pages and crawler HTML tables in Python

1 Answers1