Web Scraping Table to JSON

Question

I have a task that I think my python skills far slightly short at. I have a oil production data scraped from an open data website and looking to turn this into a json format.

At present with some help I have run through html table tags to build a list of headers and the row data. What I am struggling with is to nest some of this data within one json record.

Right now I either have all the json by row or per header. I have a header and all the column data under it, then the next header with its data going onto the next header (as below).

If possible I would like to have either the headers and the rows data assigned to it as one record, the next would have the headers again but with row 2 data and then onto the next json record with headers and row 3. The other option will make sense if looking at the table source…. would be to have a record per oil field, this field might have multiple rows for different years/months production values.

If possible I would like to capture all that in on json field record. Within the json record it would have the multiple year/,month rows should they exist, captured in the one json record.

Essentially it think it needs to make use of a few loops with the rest of the python on distinct html table cells, which I think is achievable just is slightly outside my python capabilities. Hope this makes sense tried to be quite descriptive.

JSON by Row

[
"\u00c5SGARD",
"2017",
"8",
"0.19441",
"0.81545",
"0.26954",
"0.00000",
"1.27940",
"0.07432",
"43765"
]

Per Headers (every data column under these fields)

[
"Field (Discovery)":[cell1,cell2,cell3,cell4 etc]
"Year":[cell1,cell2,cell3,cell4 etc],
"Month":[cell1,cell2,cell3,cell4 etc],
"Oil - saleable div[mill Sm3]":[cell1,cell2,cell3,cell4 etc],
"Gas - saleable div[bill Sm3]":[cell1,cell2,cell3,cell4 etc],
"NGL - saleable div[mill Sm3]",
"Condensate - saleable div[mill Sm3]",
"Oil equivalents - saleable div[mill Sm3]",
"Water - wellbores div[mill Sm3]",
"NPDID information carrier"
]

Desired per record (year:month fields concatenated)

{
 " Field (Discovery)":"asset1" ,
 " Oil – saleable":
  [
   { "yearmonth":"201701","unit":"mmboe","value":"1234.456"},                    
   { "yearmonth ":"201702","unit":"mmboe","value":"124.46"}],
 "Gas - saleable":              
  [
    {"yearmonth":"201701","unit":" bill Sm3","value":"1234.456"},
    {" yearmonth ":"201702","unit":"mill Sm3","value":"14.456"}],
 "NGL - saleable ":
  [
   {"yearmonth":"201704","unit":"mill Sm3","value":"1.456"}, 
   {" yearmonth ":"201706","unit":" bill Sm3","value":"14.456"}], 
 "Condensate – saleable":
  [
   {"yearmonth":"201701","unit":" mill Sm3","value":"23.60"},
   {"yearmonth ":"201608","unit":"mill Sm3","value":"4.4"}],
  "NPDID information carrier":"43765"
}

current python script

import requests
from bs4 import BeautifulSoup
import json
import boto3
import botocore
from datetime import datetime
from collections import OrderedDict

starttime = datetime.now()

#Agent detail to prevent scraping bot detection
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 
Safari/537.36'

header = {'User-Agent' : user_agent }

# Webpage connection
html = "http://factpages.npd.no/ReportServer?/FactPages/TableView/
field_production_monthly&rs:Command=Render&rc:Toolbar=false
&rc:Parameters=f&Top100=True&IpAddress=108.171.128.174&CultureCode=en"

r=requests.get(html, headers=header)
c=r.content
soup=BeautifulSoup(c,"html.parser")


rows = soup.findAll('td', {
'class': ['a61cl', 'a65cr', 'a69cr', 'a73cr', 'a77cr', 'a81cr', 
'a85cr','a89cr', 'a93c', 'a97cr']})

headers = soup.findAll('td', {
'class': ['a20c', 'a24c', 'a28c', 'a32c', 'a36c', 'a40c', 'a44c', 
'a48c','a52c', 'a56c']})

headers_list = [item.getText('div') for item in headers]

rows_list=[item.getText('div') for item in rows]

final=[rows_list[item:item+10] for item in range(0,len(rows_list),10)]

row_header={}
 for item in final:
  for indices in range(0,10):
    if headers_list[indices] not in row_header:
        row_header[headers_list[indices]]=[item[indices]]
      else:
        row_header[headers_list[indices]].append(item[indices])


#OrderedDict
result= (json.dumps(row_header, indent=4, sort_keys=True, 
ensure_ascii=False))
 with open('data.txt', 'wt') as outfile:

#json dump print
json.dump(result, outfile,indent=4, sort_keys=True,
separators=(',', ': '), ensure_ascii=False)


#Time
runtime = datetime.now() - starttime
print(runtime)

score 0 · Answer 1 · answered Nov 14 '17 at 20:01

0

Without seeing the HTML it is hard to tell but I am guessing parsing your table would be better done like this: python BeautifulSoup parsing table . That way you can be sure the headers and values match up.

I think it would be a disservice to your education to give the answer. So I hope the following helps:

If we have a list of items (rows!) [ ('a', 1 ), {'a', 2), ('a', 3), ('b', 1), ('b', 2), ('b', 3)] and we want to turn them into a list of [ ( 'a': [1, 2, 3] ), ('b': [1, 2, 3] ) ]

then we could just add a dictionary

d = {}
for v1, v2 in items:
  d.setdefault(v1, []).append(v2)

setdefault here is just a shortcut for having a list by default for our list.

After this d.items() will have the values. The next step would then be to incorporate the headers. Perhaps start with one column Oil – saleable only at first and build from there.

answered Nov 14 '17 at 20:01

typingduck

825
10
15

Thanks for the help, I would like to understand how to get it working myself so I appreciate the assistance. I'll do a bit of studying later to see if I can get some bs dictionaries implemented. – Chris Nov 15 '17 at 11:11
If it helps, feel free to ask for clarifying questions here :) – typingduck Nov 15 '17 at 13:36
got a little more understanding of nested dictionaries but not entirely sure if I'm getting slightly lost now. Have the following producing essentially what I want for one row just needs to loop for each row and loop internally to nest the multiple rows per `'fieldname'` records based on the `'yearmonth'`. [code](https://pyfiddle.io/fiddle/909beab3-d34a-4099-b8fc-04eaecfcd85b/?i=true) – Chris Nov 16 '17 at 14:30

Web Scraping Table to JSON

1 Answers1