1

Hello I need some piece of advice how to deal with my problem. So the thing is, I want to scrape information from the given table, and add those sting into list as a singular elements.

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
    
site = "http://www.voltwo.webd.pl/1-indexy/index-5-opracowania/01-maturalne-KINEMATYKA.html"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = Request(site, headers=hdr)
page = urlopen(req)
soup = BeautifulSoup(page, features="html5lib")
table = soup.findAll('td')

data = [i.find('tr') for i in table if i.find('tr') and 'text']
print(data)
Porfinogeneta
  • 49
  • 1
  • 7
  • 2
    What problems or errors are you encountering? – TheLazyScripter Dec 02 '20 at 20:51
  • I want to get all string from the table into string list; for example first row = [90, 2020, maj.czer, PP, zamknięte, 2/1,Po rzece płynie motorówka1] sth like that. I want to get rid of all \n and \t, of course without the gaps between words. Is it somehow possible? – Porfinogeneta Dec 02 '20 at 21:06

1 Answers1

0

How about using pandas since this is tabular data you're dealing with?

For example:

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = "http://www.voltwo.webd.pl/1-indexy/index-5-opracowania/01-maturalne-KINEMATYKA.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser").find("table")
df = pd.read_html(str(soup), skiprows=[0, 1, 2, 3, 4, 5])
df = pd.concat(df).drop([7, 8], axis="columns")
columns = ["Lp", "Rok", "Forma", "Poziom", "Typ zadania", "Strona", "Zadanie"]
df.to_csv("table.csv", index=False, header=columns, encoding="utf-8")

Output:

enter image description here

baduker
  • 19,152
  • 9
  • 33
  • 56
  • I wanted to try your solution, but I can't even make the program start. I always get the same error 'ImportError: lxml not found, please install it', even though I had installed this 'lxml'. – Porfinogeneta Dec 03 '20 at 13:27
  • You sure you've installed the right module in the right environment? Are you using venv? Please share the full error traceback. – baduker Dec 03 '20 at 13:31
  • That's my error: Traceback (most recent call last): File "/home/szymonm/PycharmProjects/web_scrap_usingBeautifulSoup/test.py", line 7, in df = pd.read_html(str(soup)) File "/usr/local/lib/python3.6/dist-packages/pandas/util/_decorators.py", line 296, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/pandas/io/html.py", line 1101, in read_html displayed_only=displayed_only, – Porfinogeneta Dec 03 '20 at 15:32
  • File "/usr/local/lib/python3.6/dist-packages/pandas/io/html.py", line 894, in _parse parser = _parser_dispatch(flav) File "/usr/local/lib/python3.6/dist-packages/pandas/io/html.py", line 851, in _parser_dispatch raise ImportError("lxml not found, please install it") ImportError: lxml not found, please install it – Porfinogeneta Dec 03 '20 at 15:32
  • Do this https://stackoverflow.com/questions/44954802/python-importerror-lxml-not-found-please-install-it – baduker Dec 03 '20 at 15:46
  • I did it but still, it somehow doesn't want to be installed, from the terminal I get 'Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: lxml in /usr/lib/python3/dist-packages (4.5.0)'; I have python 3.6, latest pip, and I completely don't know what else could I have done. I also tried to install in via Pycharm and it doesn't see this lxml, even though pycharm says it's installed – Porfinogeneta Dec 03 '20 at 16:18
  • Create a new venv in Pycharm and set up a fresh project iterpreter and then try installing both pandas and lxml – baduker Dec 03 '20 at 16:21
  • Thank you, I've done what you have recommended, and finally it worked. I very appreciate your help, especially that you helped me second time. Thank you! – Porfinogeneta Dec 03 '20 at 18:15
  • Nie ma za co. ;) – baduker Dec 03 '20 at 19:35
  • coś przeczuwałem xD – Porfinogeneta Dec 03 '20 at 19:43