I have a task that I think my python skills far slightly short at. I have a oil production data scraped from an open data website and looking to turn this into a json format.
At present with some help I have run through html table tags to build a list of headers and the row data. What I am struggling with is to nest some of this data within one json record.
Right now I either have all the json by row or per header. I have a header and all the column data under it, then the next header with its data going onto the next header (as below).
If possible I would like to have either the headers and the rows data assigned to it as one record, the next would have the headers again but with row 2 data and then onto the next json record with headers and row 3. The other option will make sense if looking at the table source…. would be to have a record per oil field, this field might have multiple rows for different years/months production values.
If possible I would like to capture all that in on json field record. Within the json record it would have the multiple year/,month rows should they exist, captured in the one json record.
Essentially it think it needs to make use of a few loops with the rest of the python on distinct html table cells, which I think is achievable just is slightly outside my python capabilities. Hope this makes sense tried to be quite descriptive.
JSON by Row
[
"\u00c5SGARD",
"2017",
"8",
"0.19441",
"0.81545",
"0.26954",
"0.00000",
"1.27940",
"0.07432",
"43765"
]
Per Headers (every data column under these fields)
[
"Field (Discovery)":[cell1,cell2,cell3,cell4 etc]
"Year":[cell1,cell2,cell3,cell4 etc],
"Month":[cell1,cell2,cell3,cell4 etc],
"Oil - saleable div[mill Sm3]":[cell1,cell2,cell3,cell4 etc],
"Gas - saleable div[bill Sm3]":[cell1,cell2,cell3,cell4 etc],
"NGL - saleable div[mill Sm3]",
"Condensate - saleable div[mill Sm3]",
"Oil equivalents - saleable div[mill Sm3]",
"Water - wellbores div[mill Sm3]",
"NPDID information carrier"
]
Desired per record (year:month fields concatenated)
{
" Field (Discovery)":"asset1" ,
" Oil – saleable":
[
{ "yearmonth":"201701","unit":"mmboe","value":"1234.456"},
{ "yearmonth ":"201702","unit":"mmboe","value":"124.46"}],
"Gas - saleable":
[
{"yearmonth":"201701","unit":" bill Sm3","value":"1234.456"},
{" yearmonth ":"201702","unit":"mill Sm3","value":"14.456"}],
"NGL - saleable ":
[
{"yearmonth":"201704","unit":"mill Sm3","value":"1.456"},
{" yearmonth ":"201706","unit":" bill Sm3","value":"14.456"}],
"Condensate – saleable":
[
{"yearmonth":"201701","unit":" mill Sm3","value":"23.60"},
{"yearmonth ":"201608","unit":"mill Sm3","value":"4.4"}],
"NPDID information carrier":"43765"
}
current python script
import requests
from bs4 import BeautifulSoup
import json
import boto3
import botocore
from datetime import datetime
from collections import OrderedDict
starttime = datetime.now()
#Agent detail to prevent scraping bot detection
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47
Safari/537.36'
header = {'User-Agent' : user_agent }
# Webpage connection
html = "http://factpages.npd.no/ReportServer?/FactPages/TableView/
field_production_monthly&rs:Command=Render&rc:Toolbar=false
&rc:Parameters=f&Top100=True&IpAddress=108.171.128.174&CultureCode=en"
r=requests.get(html, headers=header)
c=r.content
soup=BeautifulSoup(c,"html.parser")
rows = soup.findAll('td', {
'class': ['a61cl', 'a65cr', 'a69cr', 'a73cr', 'a77cr', 'a81cr',
'a85cr','a89cr', 'a93c', 'a97cr']})
headers = soup.findAll('td', {
'class': ['a20c', 'a24c', 'a28c', 'a32c', 'a36c', 'a40c', 'a44c',
'a48c','a52c', 'a56c']})
headers_list = [item.getText('div') for item in headers]
rows_list=[item.getText('div') for item in rows]
final=[rows_list[item:item+10] for item in range(0,len(rows_list),10)]
row_header={}
for item in final:
for indices in range(0,10):
if headers_list[indices] not in row_header:
row_header[headers_list[indices]]=[item[indices]]
else:
row_header[headers_list[indices]].append(item[indices])
#OrderedDict
result= (json.dumps(row_header, indent=4, sort_keys=True,
ensure_ascii=False))
with open('data.txt', 'wt') as outfile:
#json dump print
json.dump(result, outfile,indent=4, sort_keys=True,
separators=(',', ': '), ensure_ascii=False)
#Time
runtime = datetime.now() - starttime
print(runtime)