Iterate and extract tables from web saving as excel file in Python

Question

I want to iterate and extract table from the link here, then save as excel file.

How can I do that? Thank you.

My code so far:

import pandas as pd
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate

url = 'http://zjj.sz.gov.cn/ztfw/gcjs/xmxx/jgysba/'
res = requests.get(url)
soup = BeautifulSoup(res.content,'lxml')
print(soup)

New update:

from requests import post
import json
import pandas as pd
import numpy as np

headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36",
        "Referer": "http://zjj.sz.gov.cn/projreg/public/jgys/jgysList.jsp"}
dfs = []
#dfs = pd.DataFrame()

for page in range(0, 10):
    data = {"limit": 100, "offset": page * 100, "pageNumber": page + 1}
    json_arr = requests.post("http://zjj.sz.gov.cn/projreg/public/jgys/webService/getJgysLogList.json", headers = headers, data = data).text
    d = json.loads(json_arr)
    df = pd.read_json(json.dumps(d['rows']) , orient='list')
    dfs.append(df)
    print(dfs)

dfs = pd.concat(dfs)
#https://stackoverflow.com/questions/57842073/pandas-how-to-drop-rows-when-all-float-columns-are-nan
dfs = dfs.loc[:, ~dfs.replace(0, np.nan).isna().all()]
dfs.to_excel('test.xlsx', index = False)

It generates 10 pages and 1000 rows, but some columns values are misplaced, someone knows where did I do wrong? Thank you.

Did you write any code to get this data? if you did, please post it with question along with the problem faced, if any. — ans2human, Jan 21 '20 at 06:12
Yea, as must know you can't get javascript injected data via `bs4`. But if you look in Network tab there's a JSON API with all the paginated data. You can use that to directly get the data you want. — ans2human, Jan 21 '20 at 06:57

score 1 · Accepted Answer · answered Jan 21 '20 at 07:27

1

So, using the JSON API from XHR you make a simple python post request via requests and you have your data.

In the params you have two of them which you can change to get different volumes of data, limit is the nos of objects you get in a request. pageNumber is the paginated page counter.

from requests import post
import json

url = 'http://zjj.sz.gov.cn/projreg/public/jgys/webService/getJgysLogList.json'
data = { 'limit' : '100', 'pageNumber' : '1'}
response = post(url, data=d)
response.text

Further you can use pandas to create a data frame or create a excel as you want.

answered Jan 21 '20 at 07:27

ans2human

2,300
1
14
29

Thank you, I update my complete code, but the result seems not correct, could u check where I'm doing wrong? – ah bon Jan 21 '20 at 09:41
1

yea sure, lemme check. – ans2human Jan 21 '20 at 10:24
Just wonder if you have found any problems in my code? @ans2human – ah bon Jan 22 '20 at 01:15
So i see you're getting the required data! what's not working? Dataframe construction? – ans2human Jan 22 '20 at 04:16
It seems works, but I feel not certain if should use `append` and `concat`. – ah bon Jan 22 '20 at 06:29
1

can you create a new question for converting this data to dataframe then to excel? – ans2human Jan 22 '20 at 07:05
http://bjjs.zjw.beijing.gov.cn/eportal/ui?pageId=308894, why I can't find `json` type url as for this website? Could u please help take a look? – ah bon Jan 22 '20 at 09:24
https://stackoverflow.com/questions/59856766/iterate-append-json-and-save-as-dataframe-in-python I create a new question here, please check it @ans2humna – ah bon Jan 22 '20 at 09:49
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/206428/discussion-between-ans2human-and-ahbon). – ans2human Jan 22 '20 at 10:04

Iterate and extract tables from web saving as excel file in Python

1 Answers1

Linked