How can I convert the beautiful soup text to JSON object?

Question

What I'm trying to do is to convert the scraped data I get from the URL to JSON object.

import bs4 as bs
from urllib.request import Request, urlopen
import json

req = Request('https://www.worldometers.info/gdp/albania-gdp/',
              headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = bs.BeautifulSoup(webpage, 'html.parser')

gdp = soup.select_one('span[style="margin-right:7px"]')
# print('gdp:', type(gdp.text))

gdp_growth_rate = soup.find('br').next_sibling
# print('gdp growth_rate', type(gdp_growth_rate.text))

gdp_historic = soup.find(
    'table', class_='table table-striped table-bordered table-hover table-condensed table-list')
# print('gdp historic: ', type(gdp_historic.text, sep='\n'))

The idea is for the data I get from the table, to convert to JSON. The purpose of this is to create an API.

HedgeHog · Answer 1 · 2022-04-06T12:30:50.137

In general

How can I convert the beautiful soup text to json object?

You could convert any python object (dict, list, tuple, string,...) into a JSON string by using the json.dumps() method:

json.dumps(
    dict(
        gdp = soup.select_one('span[style="margin-right:7px"]').text
    )
)

Output:

{"gdp": "$13,038,538,300"}

Table to JSON

Best practice in my opinion scraping a basic table is pandas.read_html() it uses beautifulsoup under the hood and provides multiple formats to convert your data e.g. .to_json().

Not clear from your question is what JSON string format you may expect.

pandas.to_json() uses a parameter orient that might be usefull and provides a format for your needs - Standard for DataFrame is the value columns that leads to dict like structure {column -> {index -> value}}

Example

import pandas as pd
import requests
pd.read_html(requests.get('https://www.worldometers.info/gdp/albania-gdp/',
                          headers={'User-agent': 'Mozilla/5.0'}
                         ).text
            )[1].to_json()

Output

First 5 rows as sample.

{"Year":{"0":2017,"1":2016,"2":2015,"3":2014,"4":2013},"GDP Nominal (Current USD)":{"0":"$13,038,538,300","1":"$11,883,682,171","2":"$11,386,931,490","3":"$13,228,247,844","4":"$12,776,280,961"},"GDP Real (Inflation adj.)":{"0":"$13,986,932,579","1":"$13,470,274,302","2":"$13,033,647,123","3":"$12,750,584,155","4":"$12,528,823,971"},"GDP change":{"0":"3.84%","1":"3.35%","2":"2.22%","3":"1.77%","4":"1.00%"},"GDP per capita":{"0":"$4,850","1":"$4,667","2":"$4,509","3":"$4,402","4":"$4,315"},"Pop. change":{"0":"-0.08 %","1":"-0.14 %","2":"-0.20 %","3":"-0.26 %","4":"-0.35 %"},"Population":{"0":2884169,"1":2886438,"2":2890513,"3":2896305,"4":2903790}}

score 1 · Answer 2 · answered Apr 06 '22 at 08:28

The table extraction is mostly answered here, although not the column names.

I have used the same approach, but as you are using some old libraries, e.g. urllib, this is a more modern way to do it. I have also used pandas to parse the table and then extract to json easily.

# These libraries are easiest
from bs4 import BeautifulSoup
import requests
import json
import pandas as pd

# Download page
req = requests.get('https://www.worldometers.info/gdp/albania-gdp/',
              headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(req.content, 'html.parser')

# Extract table
gdp_historic = soup.find(
    'table', class_='table table-striped table-bordered table-hover table-condensed table-list')

table_body = gdp_historic.find('tbody')
data = []
rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele]) # Get rid of empty values


# Extract column names
colnames = [heading.text for heading in gdp_historic.findAll('th')]


# Convert to json
pd.DataFrame(data, columns=colnames).to_json()