-1

i want to read all the text information from an html page that i have stored locally. i managed to get it to read all the page's information but it is also reading the html tags and javascript code.

i am trying to get the information from a downloading html file and not a url from a website. i want a method to only get the text from the html page i have that works with my code below

how can i make it such that it only writes the text that is in the html page into the text file?

here is my code:

with open("ct.html","r",encoding='utf') as f:
    data = f.read()

with open("test.txt", "w",encoding='utf-8-sig') as f:
    for line in data:
        f.write(line)
  • 3
    Does this answer your question? [BeautifulSoup Grab Visible Webpage Text](https://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text) – PacketLoss Oct 12 '20 at 22:47
  • Have you tried anything, done any research? Please see [ask], [help/on-topic]. – AMC Oct 13 '20 at 00:56

1 Answers1

1

You can also try some new methods.

from simplified_scrapy import SimplifiedDoc, utils, req

html =  utils.getFileContent('test.html')
doc = SimplifiedDoc(html)
utils.appendFile('test.txt', doc.text)
# Or
utils.appendFile('test2.txt', doc.title.text)
utils.appendFile('test2.txt', doc.body.text)
yazz
  • 321
  • 1
  • 4
  • this is the correct answer thank you, but i am getting all the text on the same line, how do i make line breaks after certain amount of characters? –  Oct 13 '20 at 15:29
  • @mike There may not be a ready-made method for this. If you just want to view in text, you can set Notepad to wrap. In addition, you can use the following method to keep the original line breaks in the page. doc.body.getText('\n') – yazz Oct 14 '20 at 00:03