0

community,

I have recently picked up coding and am currently working on my first project using Python3, urllib, BeautifulSoup, RegEx, and SQLite. My goal is to assemble a database of startups including their name, website url, industry, and street address.

My approach so far has been that I insert the link to a VC's portfolio page, extract the link to the companies, and get the html of the companies' imprint-site and save all of that into an SQLite database. Most of it works (except some edge cases). My problem is getting the street addresses from the HTML. I have tried multiple regex approaches but nothing really worked as the formatting is so diverse across the different websites. Additionally, this post made me feel like I am on the wrong path here trying to use regex on HTML in the first place.

Any suggestions on how to get the job done?

(I know that there are a few existing databases out there (crunchbase, ...) already that I could be using but I'd prefer using this opportunity to practice web scraping) Cheers, Marc

import sqlite3
import urllib.error
import ssl
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup

conn = sqlite3.connect('startupmap.sqlite')
cur = conn.cursor()
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

document = urlopen('https://about.eslgaming.com/imprint/',context=ctx)
html = document.read()

print('Type html:' , type(html))
soup = BeautifulSoup(html,'html.parser')
html_text = soup.get_text()
print(html_text)
print('Type html_text:' , type(html_text))

print(soup.prettify())
address = re.findall('\s[0-9]{5}\s+[a-zA-ZäöüÄÖÜ]+',html_text)
print('Address:' ,address)

And this is the piece of information (highlighted in yellow) I am trying to get:

Company address from the imprint page

Some of the formats I came across and there are many more:

The individual pages vary in terms of information provided (company's legal name, street + nr., zip, city, country, legal identification numbers such as tax id, or other information), styling (within text, in one line, on separate lines), and what type of container they are in (div, paragraph, ...)

Sometimes, the addresses are provided within the same xpath as and h1 or h2 with the text 'Imprint' or 'Contact'. Unfortunately, this is far from being the standard.

  • Do you have any code example? please get through [example] – dboy Jul 10 '20 at 18:18
  • 3
    This might be too broad/vague. Please see [ask], [help/on-topic]. – AMC Jul 10 '20 at 18:30
  • Could you provide a couple of examples of the different formats you came across? It would be interesting to see whether the address is always nested inside a paragraph tag or whether it's always the n'th sibling of another - easier - identifiable tag, like the `h2` heading or email-address. – Gregor Jul 12 '20 at 09:12
  • Parsing HTML with regular expressions is indeed a bad idea if you are looking for general cases (you will have to count opening and closing tags, account for void tags https://html.spec.whatwg.org/multipage/syntax.html#void-elements, and so on). You instead, don't want to use regex to extract/parse the HTML structure (as this is why you used the `BeautifulSoup` `html.parser` in the first place) but you try to use it on text inside a specific tag (that you found using `BeautifulSoup`). – Gregor Jul 12 '20 at 09:19
  • Hi Marc, was wondering if you managed to solve this issue, because I have to do the same thing. Regards! – Lucas Mengual Jun 09 '21 at 10:19

0 Answers0