community,
I have recently picked up coding and am currently working on my first project using Python3, urllib, BeautifulSoup, RegEx, and SQLite. My goal is to assemble a database of startups including their name, website url, industry, and street address.
My approach so far has been that I insert the link to a VC's portfolio page, extract the link to the companies, and get the html of the companies' imprint-site and save all of that into an SQLite database. Most of it works (except some edge cases). My problem is getting the street addresses from the HTML. I have tried multiple regex approaches but nothing really worked as the formatting is so diverse across the different websites. Additionally, this post made me feel like I am on the wrong path here trying to use regex on HTML in the first place.
Any suggestions on how to get the job done?
(I know that there are a few existing databases out there (crunchbase, ...) already that I could be using but I'd prefer using this opportunity to practice web scraping) Cheers, Marc
import sqlite3
import urllib.error
import ssl
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup
conn = sqlite3.connect('startupmap.sqlite')
cur = conn.cursor()
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
document = urlopen('https://about.eslgaming.com/imprint/',context=ctx)
html = document.read()
print('Type html:' , type(html))
soup = BeautifulSoup(html,'html.parser')
html_text = soup.get_text()
print(html_text)
print('Type html_text:' , type(html_text))
print(soup.prettify())
address = re.findall('\s[0-9]{5}\s+[a-zA-ZäöüÄÖÜ]+',html_text)
print('Address:' ,address)
And this is the piece of information (highlighted in yellow) I am trying to get:
Company address from the imprint page
Some of the formats I came across and there are many more:
- Street, Nr., Zip, City in one line inside a paragraph with other information (https://n26.com/en-de/imprint)
- Company name, street + Nr., Zip + City, country on separate lines in a div block (https://www.getyourguide.com/legal)
- 'Address of company' title, street + Nr., Zip + City, country on separate lines inside a paragraph (https://www.wirecard.com/de/impressum)
The individual pages vary in terms of information provided (company's legal name, street + nr., zip, city, country, legal identification numbers such as tax id, or other information), styling (within text, in one line, on separate lines), and what type of container they are in (div, paragraph, ...)
Sometimes, the addresses are provided within the same xpath as and h1 or h2 with the text 'Imprint' or 'Contact'. Unfortunately, this is far from being the standard.