0

I want to iterate and extract table from the link here, then save as excel file.

How can I do that? Thank you.

xtra

My code so far:

import pandas as pd
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate

url = 'http://zjj.sz.gov.cn/ztfw/gcjs/xmxx/jgysba/'
res = requests.get(url)
soup = BeautifulSoup(res.content,'lxml')
print(soup)

New update:

from requests import post
import json
import pandas as pd
import numpy as np

headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36",
        "Referer": "http://zjj.sz.gov.cn/projreg/public/jgys/jgysList.jsp"}
dfs = []
#dfs = pd.DataFrame()

for page in range(0, 10):
    data = {"limit": 100, "offset": page * 100, "pageNumber": page + 1}
    json_arr = requests.post("http://zjj.sz.gov.cn/projreg/public/jgys/webService/getJgysLogList.json", headers = headers, data = data).text
    d = json.loads(json_arr)
    df = pd.read_json(json.dumps(d['rows']) , orient='list')
    dfs.append(df)
    print(dfs)

dfs = pd.concat(dfs)
#https://stackoverflow.com/questions/57842073/pandas-how-to-drop-rows-when-all-float-columns-are-nan
dfs = dfs.loc[:, ~dfs.replace(0, np.nan).isna().all()]
dfs.to_excel('test.xlsx', index = False)

It generates 10 pages and 1000 rows, but some columns values are misplaced, someone knows where did I do wrong? Thank you.

ah bon
  • 9,293
  • 12
  • 65
  • 148
  • 1
    Did you write any code to get this data? if you did, please post it with question along with the problem faced, if any. – ans2human Jan 21 '20 at 06:12
  • Yea, as must know you can't get javascript injected data via `bs4`. But if you look in Network tab there's a JSON API with all the paginated data. You can use that to directly get the data you want. – ans2human Jan 21 '20 at 06:57
  • Could you help modify the code? @ans2human – ah bon Jan 21 '20 at 07:04

1 Answers1

1

So, using the JSON API from XHR you make a simple python post request via requests and you have your data.

In the params you have two of them which you can change to get different volumes of data, limit is the nos of objects you get in a request. pageNumber is the paginated page counter.

from requests import post
import json

url = 'http://zjj.sz.gov.cn/projreg/public/jgys/webService/getJgysLogList.json'
data = { 'limit' : '100', 'pageNumber' : '1'}
response = post(url, data=d)
response.text

Further you can use pandas to create a data frame or create a excel as you want.

ans2human
  • 2,300
  • 1
  • 14
  • 29