0

I am trying to retrieve data from a table via beautifulsoup, but somehow my (beginner) syntax is wrong:

from bs4 import BeautifulSoup
import requests

main_url = "https://www.apo-in.de/product/acc-akut-600-brausetabletten.24170.html"

req = requests.get(main_url)
soup = BeautifulSoup(req.text, "html.parser")

title = soup.find("div", id = "accordionContent5e95581b6e244")

results = {}
for row in title.findAll('tr'):
     aux = row.findAll('td')
     results[aux[0].string] = aux[1].string

print(results)

This is the relevant code:

<div id="accordionContent5e95581b6e244" class="panel-collapse collapse in"> 
    <div class="panel-body"> 
        <table class="table" width="100%"> 
            <tbody>
                <tr> 
                    <th width="170">PZN</th>
                    <td>00520917</td> 
                </tr> 
                <tr> 
                    <th width="170">Anbieter</th> 
                    <td>Hexal AG</td>
                </tr>

My goal is to retrieve a dictionary from the th td cells.

How can this be done in beautifulsoup?

merlin
  • 2,717
  • 3
  • 29
  • 59
  • Does this answer your question? [BeautifulSoup, a dictionary from an HTML table](https://stackoverflow.com/questions/11901846/beautifulsoup-a-dictionary-from-an-html-table) – Josh Friedlander Apr 14 '20 at 06:52
  • Not until now. I tried to apply the example but do get the error: from bs4 import BeautifulSoup ImportError: bad magic number in 'bs4': b'\x03\xf3\r\n' – merlin Apr 14 '20 at 07:24
  • What problem does it arise? I might try iterating over all tr and td. for row in title.findAll('tr'): for aux in row.findAll('td'): results[aux[0].string] = aux[1].string Is there more than one 'td' in your html? If not, why using 'findAll function'? – CarlosSR Apr 14 '20 at 07:35
  • There is only one td in each tr. I use scrapy normaly and now try using BS4 for the first time, can't get it running as executing the test.py file results in the mentioned import error. Thank you for any help to get me started on this topic. – merlin Apr 14 '20 at 07:41
  • 1
    @merlin that's a COMPLETELY unrelated error. Your bs4 installation isn't working properly. See [here](https://stackoverflow.com/questions/514371/whats-the-bad-magic-number-error?rq=1) – Josh Friedlander Apr 14 '20 at 07:49

2 Answers2

1

I would suggest use pandas to store data in Data Frame and then import into dictionary.

import pandas as pd
from bs4 import BeautifulSoup
import requests

main_url = "https://www.apo-in.de/product/acc-akut-600-brausetabletten.24170.html"

req = requests.get(main_url)
soup = BeautifulSoup(req.text, "html.parser")
table=soup.select_one(".panel-body >table")
df=pd.read_html(str(table))[0]
print(df.set_index(0).to_dict('dict'))

Output:

{1: {'Rezeptpflichtig': 'nein', 'Anbieter': 'Hexal AG', 'PZN': '00520917', 'Darreichungsform': 'Brausetabletten', 'Wirksubstanz': 'Acetylcystein', 'Monopräparat': 'ja', 'Packungsgröße': '40\xa0St', 'Apothekenpflichtig': 'ja', 'Produktname': 'ACC akut 600mg Hustenlöser'}}
KunduK
  • 32,888
  • 5
  • 17
  • 41
0
  1. First Mistake : You are using id which varies of you want to scrape more pages .
  2. Second Mistake : aux = row.findAll('td') this will return list of one item because you are not taking into consideration the th tags that means aux[1].string will raise an exception .

Here is the code :

from bs4 import BeautifulSoup
import requests

main_url = "https://www.apo-in.de/product/acc-akut-600-brausetabletten.24170.html"

req = requests.get(main_url)
soup = BeautifulSoup(req.text, "html.parser")

title = soup.find("div", class_="panel-collapse collapse in")

results = {}
for row in title.findAll('tr'):
    key   = row.find('th')
    value = row.find('td')
    results[key.text] =value.text.strip()
print(results)

Output:

{'PZN': '00520917', 'Anbieter': 'Hexal AG', 'Packungsgröße': '40\xa0St', 'Produktname': 'ACC akut 600mg Hustenlöser', 'Darreichungsform': 'Brausetabletten', 'Monopräparat': 'ja', 'Wirksubstanz': 'Acetylcystein', 'Rezeptpflichtig': 'nein', 'Apothekenpflichtig': 'ja'}
Ahmed Soliman
  • 1,662
  • 1
  • 11
  • 16