-1

I am trying to understand how beautiful soup works in python. I used beautiful soup,lxml in my past but now trying to implement one script which can read data from given webpage without any third-party libraries but it looks like xml module don't have much options and throwing many errors. Is there any other library with good documentation for reading data from web page? I am not using these scripts on any particular websites. I am just trying to read from public pages and news blogs.

jack
  • 49
  • 9
  • you can use scrapy, but it's more complex than beautiful soup. – polku May 27 '16 at 14:35
  • Hi polku, Thanks for the comment. But i am trying to use with out third-party library. I mean i don't want to install any library and scrape it. Is there a way? – jack May 27 '16 at 15:33
  • I don't think you have much choice, parsing html is not a trivial task. If you continue to look in this direction you're probably close (maybe it's already too late) to hear about regex and to think it will be a good idea ... spoiler alert : IT'S NOT, it's a terrible idea that a lot of people had and regretted before you (including me) http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – polku May 27 '16 at 15:55
  • Well if it is a learning experience you certainly _can_. After all, the modules itself are written in python. You can read websites with [urllib](https://docs.python.org/3/library/urllib.html), and then parse it with [html.parser](https://docs.python.org/3/library/html.parser.html). Writing all of this yourself can also be done, but it is a non-insignificant task. However, this is a learning experience, so go ham! I learned lot's of things by doing things the unecessarily hard way. – mzhaase May 27 '16 at 16:09
  • This course is free and actually teaches you in the first section how to make your own web scraper with no additional libraries, https://www.udacity.com/course/intro-to-computer-science--cs101. It will be a series of find() mixed with variables containing index values so it knows where to continue. It's worth going through. – John Morrison May 27 '16 at 17:03
  • Hi mzhaase, Thanks for the encouragement. I am trying in the same way. But using html.parser is too messy. – jack May 27 '16 at 17:03
  • Does this answer your question? [Making a basic web scraper in Python with only built in libraries - Python](https://stackoverflow.com/questions/18157529/making-a-basic-web-scraper-in-python-with-only-built-in-libraries-python) – ggorlen Apr 23 '22 at 21:46

1 Answers1

0

Third party libraries exist to make your life easier. Yes, of course you could write a program without them (the authors of the libraries had to). However, why reinvent the wheel?

Your best options are beautifulsoup and scrappy. However, if your having trouble with beautifulsoup, I wouldn't try scrappy.

Perhaps you can get by with just the plain text from the website?

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
pagetxt = soup.get_text()

Then you can be done with all external libraries and just work with plain text. However, if you need to do something more complicated. HTML is something you really should use a library for manipulating. They is just too much that can go wrong.

Joseph
  • 691
  • 1
  • 4
  • 12