0

I am parsing a saved HTML file using beautiful soup, find sample below, at first I thought that beautiful soup is truncating long lines, but apparently its the open function

<!DOCTYPE html>
<html dir="ltr" lang="en-GB">
<head>
<meta charset="utf-8" />
<title>The Title </title>

<meta property="type" content="website" />

<meta property="description" content="This is the text I want, but if its too long it gets truncated"/>
</head>

I want to get the text in the content tag where proprty=description, the code I wrote works fine but when the text in content is too long it gets truncated, I want to save the text in a variable, any ideas on how to avoid the truncation to save the whole text

def parse_page(file_path):
    page = open(file_path)
    soup = BeautifulSoup(page.read()) 
    for line in page: #----> here when printing long lines are truncated thus problems when saving in variable answer
      print(line) 
    soup = BeautifulSoup(fp, "html.parser")
    answer=soup.find(property="description") #---->truncated output saved
    print('answer--->',answer['content'],'type',type(answer)) #---> when printing its truncated 

This is the code block that calls the function:

path='/content/HTMLpages'
os.chdir(path)

for file in os.listdir():
    file_path = f"{path}/{file}"
    parse_page(file_path)
IS92
  • 690
  • 1
  • 13
  • 28
  • 1
    `open()` doesn't truncate anything. It knows nothing about HTML structure and attributes, it just returns the bytes of the file. Is it truncated if you do `print(page.read())`? – Barmar Apr 06 '22 at 20:41
  • Nothing is being truncated. `page.read()` exhausts the iterator, your for loop will always be empty. You could `.seek` back to 0 – juanpa.arrivillaga Apr 06 '22 at 20:47
  • even when I do print(page.read()) one of the tags (which is long) has the text trucnacted and epsilon ... added afterwards, although when I open the saved HTML file, the whole text is displayed without truncation, I don't understand where the problem is coming from – IS92 Apr 06 '22 at 20:52
  • 1
    I suspect your terminal emulator is truncating long lines. – Barmar Apr 06 '22 at 21:10
  • Yep. Try putting the extracted line back into another text file. If you still see truncation and epsilon, then we got problems :D – Manish Dash Apr 06 '22 at 21:52
  • @ManishDash the problem is if I do that I wont be able to use beautifulsoup to parse the HTML file. What I am sure of is that the content is lost with page=open(file_path) – IS92 Apr 06 '22 at 22:38
  • @IS92 if you can verify that the data is stored into the file without truncation, then you have nothing to worry about. B4Soup will parse it correctly. The content is NOT lost at all. When you try to print, its your Terminal thats truncating it. Depending on your [IDE](https://stackoverflow.com/questions/36800475/avoid-string-printed-to-console-getting-truncated-in-rstudio) , you can have different ways to change this behaviour for your terminal. But you can continue using B4Soup to parse the data - the data is there in the variable. – Manish Dash Apr 07 '22 at 04:28
  • @ManishDash, same problem I did that : with open(file_path,'r') as firstfile, open(path+'second.txt', 'w+') as secondfile: # read content from first file for line in firstfile: # append content to second file secondfile.write(str(line)) But it is also truncated – IS92 Apr 08 '22 at 14:59
  • Oh, so when you open thus text file the content is truncated? – Manish Dash Apr 08 '22 at 15:03
  • Yes, my guess because the original open(html file) doesn't; read all the content for some reason, so 'for line in firstfile' the line is incomplete. Because when I print it is also incomplete, when I move it to the .txt its incomplete. I really don't know what to do – IS92 Apr 08 '22 at 15:07
  • It would really help to reproduce the bug if you can give the exact HTML file/content that is causing this issue. I have tried the sample in the question, tried to create dummy samples but everything is working as intended. no truncation anywhere – Manish Dash Apr 08 '22 at 21:30
  • https://support.shell.com/hc/en-gb/articles/115003030052-Where-can-I-download-the-Shell-App- @ManishDash I want to store all points, it doesn't even complete point #1 – IS92 Apr 08 '22 at 21:36
  • the site works for me. BeautifulSoup is able to get the entire HTML content without issue. You want to capture the steps to download the shell app? That content is also there in the `soup` variable – Manish Dash Apr 08 '22 at 21:43

1 Answers1

1

For the given website: https://support.shell.com/hc/en-gb/articles/115003030052-Where-can-I-download-the-Shell-App-

Here is a completely working code:

from bs4 import BeautifulSoup
import cloudscraper

def parse_page(HTML):
    soup = BeautifulSoup(HTML, "html.parser")
    # print(soup)
    # print([tag.name for tag in soup.find_all()])
    answer= soup.find_all('ol')
    print(answer)

url = 'https://support.shell.com/hc/en-gb/articles/115003030052-Where-can-I-download-the-Shell-App-'

scraper = cloudscraper.create_scraper()  
page_html = scraper.get(url).text  
print("HTML fetched. Calling BS4")
parse_page(page_html)

Output

HTML fetched. Calling BS4
[<ol class="breadcrumbs">
<li title="Shell Support ">
<a href="/hc/en-gb">Shell Support </a>
</li>
<li title="Shell App">
<a href="/hc/en-gb/categories/115000345732-Shell-App">Shell App</a>
</li>
<li title="General">
<a href="/hc/en-gb/sections/115000744231-General">General</a>
</li>
</ol>, <ol>
<li>Open the <a href="https://itunes.apple.com/gb/app/shell/id484782414?mt=8">Apple iTunes</a> store for <strong>iOS</strong> devices or <a href="https://play.google.com/store/apps/details?id=com.shell.sitibv.motorist&amp;hl=en_GB">Google Play</a> store for <strong>Android</strong> devices</li>
<li>Search for <strong>Shell - </strong>in iTunes for iOS and Google Play for Android    <strong>       <br/></strong></li>
<li><strong>Install</strong> to add the app to your device</li>
<li>Find the <strong>Shell app</strong> on your device then open to register your details and get started.</li>

Had to use the cloudscraper library to bypass Cloudflare, but thats not important.

BeautifulSoup was able to parse the entire HTML flawlessly. As you mentioned capturing the points and they were in a ordered-list tag, I added that part as well.

I hope this code sample can help you understand how bs4 works, and help clarify any misunderstandings you have.

Manish Dash
  • 2,004
  • 10
  • 20