Parsing text block in HTML with BeatifulSoup (IndustryAbout)

Question

I would like to parse entries for mines from industryAbout. In this example I'm working on the Kevitsa Copper Concentrator.

The interesting block in the HTML is:

<strong>Commodities: Copper, Nickel, Platinum, Palladium, Gold</strong><br /><strong>Area: Lappi</strong><br /><strong>Type: Copper Concentrator Plant</strong><br /><strong>Annual Production: 17,200 tonnes of Copper (2015), 8,800 tonnes of Nickel (2015), 31,900 tonnes of Platinum, 25,100 ounces of Palladium, 12,800 ounces of Gold (2015)</strong><br /><strong>Owner: Kevitsa Mining Oy</strong><br /><strong>Shareholders: Boliden AB (100%)</strong><br /><strong>Activity since: 2012</strong>

I've written a (basic) working parser, which gives me

<strong>Commodities: Copper, Nickel, Platinum, Palladium, Gold</strong>
<strong>Area: Lappi</strong>
<strong>Type: Copper Concentrator Plant</strong>
....

But I would like to get $commodities, $type, $annual_production, $shareholders and $actitivity as separate variables. How can I do this? Regular expressions??

import requests
from bs4 import BeautifulSoup
import re

page = requests.get("https://www.industryabout.com/country-territories-3/2199-finland/copper-mining/34519-kevitsa-copper-concentrator-plant") 
soup = BeautifulSoup(page.content, 'lxml')

rows = soup.select("strong")

for r in rows:
    print(r)

2nd version:

import requests
from bs4 import BeautifulSoup
import re
import csv

links = ["34519-kevitsa-copper-concentrator-plant", "34520-kevitsa-copper-mine", "34356-glogow-copper-refinery"]

for l in links:

    page = requests.get("https://www.industryabout.com/country-territories-3/2199-finland/copper-mining/"+l)
    soup = BeautifulSoup(page.content, 'lxml')
    rows = soup.select("strong")
    d = {}

    for r in rows:
        name, value, *rest = r.text.split(":")
        if not rest:
            d[name] = value
    print(d)

score 0 · Answer 1 · answered May 05 '18 at 20:41

0

Does this do what you want?:

import requests
from bs4 import BeautifulSoup

page = requests.get("https://www.industryabout.com/country-territories-3/2199-finland/copper-mining/34519-kevitsa-copper-concentrator-plant")
soup = BeautifulSoup(page.content, 'html.parser')

rows = soup.select("strong")
d = {}
for r in rows:
    name, value, *rest = r.text.split(":")
    if not rest: # links or scripts have more ":" probably not intesting for you
        d[name] = value
print(d)

answered May 05 '18 at 20:41

MegaIng

7,361
1
22
35

How to add a csv writer? To make it easier for me and complete for other people which like to parse this page. – pickenpack May 05 '18 at 20:48
@pickenpack What do you mean 'add an csv writer'? – MegaIng May 05 '18 at 20:51
I've tried to put code here, but failed to do so. Up in my question I've added a "2nd" version, where I'm trying to make the parser fool-proof (even if a entry is missing) + add csv output of the array. – pickenpack May 05 '18 at 21:01

Parsing text block in HTML with BeatifulSoup (IndustryAbout)

1 Answers1

Linked