-2

I have the following HTML:

<div class="col-sm-8"
                data-pdf-class="column8">
                <a target='_blank' href='https://datacvr.virk.dk/data/visenhed?enhedstype=person&id=4003893917'>Tove Kjeldsen</a><br/>Lundevangsvej 19<br/>2900 Hellerup<br/>Ejerandel: 5-9,99%<br/>Kapitalklasse: B<br/>Erhvervelsesdato: 30.06.1996        <br/><br/>
<a target='_blank' href='https://datacvr.virk.dk/data/visenhed?enhedstype=person&id=4004146416'>Inge Lise Klastrup</a><br/>Ærøgade 5<br/>8000 Aarhus C<br/>Ejerandel: 5-9,99%<br/>Kapitalklasse: B<br/>Erhvervelsesdato: 30.06.1996        <br/><br/>
<a target='_blank' href='https://datacvr.virk.dk/data/visenhed?enhedstype=person&id=4003886026'>Asta Johanne Kjeldsen</a><br/>Meldskiftet 9<br/>6950 Ringkøbing<br/>Ejerandel: 5-9,99%<br/>Stemmeandel: 33,33-49,99%<br/>Kapitalklasse: A, B<br/>Erhvervelsesdato: 30.06.1996        <br/><br/>
ASTA OG HENRY KJELDSENS FAMILIEFOND<br/>c/o Henry Kjeldsen<br/> Enghavevej 17<br/>6950 Ringkøbing<br/>Ejerandel: 25-33,32%<br/>Stemmeandel: 50-66,66%<br/>Kapitalklasse: A, B<br/>Erhvervelsesdato: 30.06.1996        <br/><br/>
<a target='_blank' href='https://datacvr.virk.dk/data/visenhed?enhedstype=person&id=4000019274'>Jens Lykke Kjeldsen</a><br/>Tranmose 2<br/>6950 Ringkøbing<br/>Ejerandel: 5-9,99%<br/>Kapitalklasse: A, B<br/>Erhvervelsesdato: 30.06.1996        <br/><br/>
<a target='_blank' href='https://datacvr.virk.dk/data/visenhed?enhedstype=person&id=4000271454'>Anne Birte Kjeldsen</a><br/>Enghavevej 13<br/>6950 Ringkøbing<br/>Ejerandel: 5-9,99%<br/>Kapitalklasse: B<br/>Erhvervelsesdato: 30.06.1996        <br/><br/>
HENRY KJELDSEN. RINGKØBING TØMMERHANDEL A/S<br/>Enghavevej 17<br/>6950 Ringkøbing<br/>Ejerandel: 33,33-49,99%<br/>Kapitalklasse: B<br/>Erhvervelsesdato: 30.06.1996        <br/><br/>
    </div>

and I am trying to extract the name but not all names have an 'a' tag. The output should be:

  • Tove Kjeldsen
  • Inge Lise Klastrup
  • Asta Johanne Kjeldsen
  • ASTA OG HENRY KJELDSENS FAMILIEFOND

and so on ...

Slavi
  • 120
  • 4
  • 15
  • 2
    have you tried anything? – depperm Dec 16 '15 at 19:46
  • getting all 'a' tags and then extracting the text would give me most of the names, well all but those that are not enclosed with a n a tag. I tried various ideas with regular expressions as well but no luck because the dont seem to follow the same "name pattern" if split-ed on new line – Slavi Dec 16 '15 at 19:49
  • 1
    You will not be able to do this programatically unless you can explain how names are indicated. The only way I see doing this right now is assuming that the text_content of the first element after successive br tags has the name, then it is easy and really this question has already been asked in multiple forms – PyNEwbie Dec 16 '15 at 19:56
  • I can't read Danish, but isn't virk.dk (at least partially) about providing open access to public data? If so, perhaps there is a better way to access its content than scraping HTML? – Martin Valgur Dec 16 '15 at 20:04
  • Hi Martin, unfortunately not with this data .. @PyNEwbie, I thought that was well but for example in the example html, the first element does not start with successive br tags so it would be missed – Slavi Dec 16 '15 at 20:06
  • it can't be perfect, maybe scrape the first line and then follow on. You should be able to do some analytics to quickly tell. You have to have a rule and your rules will have to be heuristic it is what it is – PyNEwbie Dec 16 '15 at 20:09
  • The best I've gotten around so far is split on

    then check if the item has a tag, get the text if not get the text up until the next
    tag. The issue out of that is when the first item encountered doesn't have an a tag because the html provided has the top class data
    – Slavi Dec 16 '15 at 20:14
  • 1
    this is not code writing resource, please provide your code first – mrDinkelman Dec 16 '15 at 20:23
  • 2
    Possible duplicate of [Parsing HTML in Python](http://stackoverflow.com/questions/717541/parsing-html-in-python) – ivan_pozdeev Dec 17 '15 at 00:55

1 Answers1

2

Although it is not entirely clear what names should be parsed from the html dump, I've found this particular piece of code to perform well.

import re

matches_result_total = list()

with open("/path/to/dump.html", "r") as file:
    file = file.read()
    matches_temp1 = re.findall("<a.+>(.+)</a>", file, re.U)
    matches_temp2 = re.findall("<br/><br/>[\n]?([^<]+)<br/>", file, re.U)
    matches_result_total = matches_temp1 + matches_temp2

print(matches_result_total)

For me, this yields the result:

['Tove Kjeldsen','Inge Lise Klastrup', 'Asta Johanne Kjeldsen', 'Jens Lykke Kjeldsen', 'Anne Birte Kjeldsen', 'ASTA OG HENRY KJELDSENS FAMILIEFOND', 'HENRY KJELDSEN. RINGKØBING TØMMERHANDEL A/S']

UPDATE:

As alecxe states in most cases it is insane to use regex to parse HTML or any complex structured language, however, if one knows how the html is structured one can limit the reach of regex to avoid dying a horrible death, as alecxe explained in the link he provided. :)

Given the structure of this particular piece of html I think it should be safe to use given the little addition I made to my code below.

import re

matches_result_total = list()

with open("/path/to/dump.html", "r") as file:
    file = file.read()
    matches_temp1 = re.findall("<br/><br/><a.+>(.+)</a><br/>", file, re.U)
    matches_temp2 = re.findall("<br/><br/>[\n]?([^<]+)<br/>", file, re.U)
    matches_result_total = matches_temp1 + matches_temp2

print(matches_result_total)

this now only matches if the given html dump starts with 2 break lines and then continues with a link tag or text.

  • First of all, this is tagged with `beautifulsoup` and, also http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags. – alecxe Dec 16 '15 at 20:41
  • Hi folkert, Thank you for your answers. This is what we discussed with PyNEwbie. Although its perfect for the provided HTML, it won't work in the case where the the first item is not in an "a" tag. I came across a solution where I'll split by <"br/"><"br/"> and then check if the resulting string has an "a" tag, if it does get the text if not the text is everything between ">" and br tag. Thank you again :) – Slavi Dec 16 '15 at 20:42