1

I don't know if this question has been asked before, but I couldn't find anything that could help solve my problem (hopefully I didn't misunderstand anything). I'm learning Python at the moment, using Python 3.5 with IPython, and I ran into some trouble using BeautifulSoup. As shown below,

import bs4
exampleFile = open('example.html')
exampleFile.read()
>>> '<html><head><title>The Website Title</title></head>\n<body>\n<p>Download my <strong>Python</strong> book from <a href=“http://inventwithpython.com”>my website</a>.</p>\n<p class=“slogan”>Learn Python the easy way!</p>\n<p>By <span id=“author”>Al Sweigart</span></p>\n</body></html>'
exampleSoup = bs4.BeautifulSoup(exampleFile.read(), 'html.parser')
exampleFile.read()
>>> ''
elems = exampleSoup.select('#author')
print(elems)
>>> []

I'm able to open and read example.html, but after I use BeautifulSoup, when I try to read the file again, it returns an empty string. I'm unable to define elems because of this.

I'm trying to understand why this is happening, but I couldn't figure it out so I decided to post a question.

Thanks in advance!

Thomas K
  • 39,200
  • 7
  • 84
  • 86
mdlee6
  • 101
  • 1
  • 1
  • 5

3 Answers3

2

I believe your issue is having multiple calls to read(). You should use seek(0) to rewind to the beginning of the file before trying to read from it again. Here is a similar question.

Community
  • 1
  • 1
Daniel Underwood
  • 2,191
  • 2
  • 22
  • 48
  • I updated my code to look like Kerry Hatcher's code, but print(exampleSoup) still returns nothing, not even an empty list. – mdlee6 Apr 21 '16 at 22:24
0

Danielu13 is correct. Here is what you want to do:

import bs4
exampleFile = open('example.html')
myHTML = exampleFile.read()
print(myHTML)
exampleSoup = bs4.BeautifulSoup(myHTML, 'html.parser')
print(exampleSoup)
elems = exampleSoup.select('#author')
print(elems)

The problem is when you call .read() on the file object, it 'empties' it to the screen. Then each .read() call on that file object from that point on is empty. In my example we save it to a string objecte named myHTML. Then we use myHTML from then on.

Note: the file object exampleFile isn't empty after you call .read(), its just that the reader is at the end of the file so there is nothing left to read. When I learned Python, the empty analogy is how someone explained it to me and it helped me understand it.

Kerry Hatcher
  • 601
  • 6
  • 17
  • I now understand that the read() function points at the end of the string, thus showing "nothing." I updated the code to look like yours, but I still get an empty list when I try print(elems). When I try print(exampleSoup), nothing shows up, not even an empty list. Here's the link to the updated code and results: https://github.com/mdlee6/bsExample/blob/master/Untitled.ipynb – mdlee6 Apr 21 '16 at 22:22
  • Check your arguments to `bs4.BeautifulSoup()`. The first argument should be HTML and not a file. – Daniel Underwood Apr 21 '16 at 22:47
  • I made the changes, but it elems is still empty :\ https://github.com/mdlee6/bsExample/blob/master/Untitled.ipynb – mdlee6 Apr 22 '16 at 00:55
0

It turns out that it was because of the weird quotes that were in the original example.html. I changed the font(?) of the quotes in another text editor, and it ended up working just fine. Thanks for all your help though. Really appreciate it!

mdlee6
  • 101
  • 1
  • 1
  • 5