-3

I have an HTML file where I am interested in BBox information with the text. After extracting the BBox with text, I appended it into a list. However, the output seems it's appending the first list (first added the first line into a list) into a second list (added a second line of string into a list). To better illustrate this problem, I attached a snippet of this problem. enter image description here

However, I want this into one single list. The following snippet illustrating the output that I want. enter image description here

Below is the simple code that I wrote:

import bs4

xml_input = open("1.html","r",encoding="utf-8")
soup = bs4.BeautifulSoup(xml_input,'lxml')
ocr_lines = soup.findAll("span", {"class": "ocr_line"})
#We will save coordinates of line and the text contained in the line in lines_structure list
lines_structure = []
for line in ocr_lines:
    line_text = line.text.replace("\n"," ").strip()
    title = line['title']
    #The coordinates of the bounding box
    x1,y1,x2,y2 = map(int, title[5:title.find(";")].split())
    lines_structure.append({"x1":x1,"y1":y1,"x2":x2,"y2":y2,"text": line_text})
    print(lines_structure)

I would really appreciate your help regarding this problem.

Lydia van Dyke
  • 2,466
  • 3
  • 13
  • 25
  • 3
    Please provide the data as text, [not an image](https://meta.stackoverflow.com/q/285551/4518341). – wjandrea Jan 21 '21 at 22:27
  • Are you trying to flatten a list? [[1, 2], [3], [4, 5, 6], [7, 8]] -> [1, 2, 3, 4, 5, 6, 7, 8] – ArjunSahlot Jan 21 '21 at 23:26
  • use extend instead of append. for more: https://stackoverflow.com/questions/252703/what-is-the-difference-between-pythons-list-methods-append-and-extend – Jim Robinson Jan 21 '21 at 23:44

1 Answers1

1

Actually, after digging, I found that the print needs to be outside of the 'for' loop. It was a quick fix. Thanks for your time.