0

I have HTML output from HOCR Tool, I would like to apply the following operation on the div class ocr_carea.

Edited Input HTML

div tags inside HTML file that looks like this.

<div class='ocr_carea' id='block_1_8' title="bbox 95 779 341 884">
    <p class='ocr_par' id='par_1_16' lang='Latin' title="bbox 95 779 341 884">
     <span class='ocr_line' id='line_1_29' title="bbox 96 779 338 800; baseline 0 -1; x_size 25.636646; x_descenders 5.6366458; x_ascenders 5">
      <span class='ocrx_word' id='word_1_62' title='bbox 96 779 186 800; x_wconf 96'>Header</span>
      <span class='ocrx_word' id='word_1_63' title='bbox 195 779 338 799; x_wconf 96'>Information</span>
     </span>
     <span class='ocr_line' id='line_1_30' title="bbox 96 819 341 839; baseline 0 0; x_size 25.26087; x_descenders 5.2608695; x_ascenders 6">
      <span class='ocrx_word' id='word_1_64' title='bbox 96 819 212 839; x_wconf 96'>Purchase</span>
      <span class='ocrx_word' id='word_1_65' title='bbox 221 819 290 839; x_wconf 96'>Order</span>
      <span class='ocrx_word' id='word_1_66' title='bbox 300 819 341 839; x_wconf 96'>No:</span>
     </span>
     <span class='ocr_line' id='line_1_31' title="bbox 95 859 334 884; baseline -0.004 -4; x_size 26; x_descenders 5; x_ascenders 7">
      <span class='ocrx_word' id='word_1_67' title='bbox 95 859 175 880; x_wconf 96'>Terms</span>
      <span class='ocrx_word' id='word_1_68' title='bbox 185 859 210 880; x_wconf 96'>of</span>
      <span class='ocrx_word' id='word_1_69' title='bbox 218 859 334 884; x_wconf 96'>Payment:</span>
     </span>
    </p>
   </div>
   <div class='ocr_carea' id='block_1_9' title="bbox 371 819 542 840">
    <p class='ocr_par' id='par_1_17' lang='Latin' title="bbox 371 819 542 840">
     <span class='ocr_line' id='line_1_32' title="bbox 371 819 542 840; baseline 0.006 -1; x_size 27.5; x_descenders 6.875; x_ascenders 6.875">
      <span class='ocrx_word' id='word_1_70' title='bbox 371 819 542 840; x_wconf 96'>4056111455</span>
     </span>
    

I want to concatenate them and order them correctly like this

 <div class='ocr_carea' id='block_1_8' title="bbox 95 779 341 884">
    <p class='ocr_par' id='par_1_16' lang='Latin' title="bbox 95 779 341 884">
     <span class='ocr_line' id='line_1_29' title="bbox 96 779 338 800; baseline 0 -1; x_size 25.636646; x_descenders 5.6366458; x_ascenders 5">
      <span class='ocrx_word' id='word_1_62' title='bbox 96 779 186 800; x_wconf 96'>Header</span>
      <span class='ocrx_word' id='word_1_63' title='bbox 195 779 338 799; x_wconf 96'>Information</span>
     </span>
     <span class='ocr_line' id='line_1_30' title="bbox 96 819 341 839; baseline 0 0; x_size 25.26087; x_descenders 5.2608695; x_ascenders 6">
      <span class='ocrx_word' id='word_1_64' title='bbox 96 819 212 839; x_wconf 96'>Purchase</span>
      <span class='ocrx_word' id='word_1_65' title='bbox 221 819 290 839; x_wconf 96'>Order</span>
      <span class='ocrx_word' id='word_1_66' title='bbox 300 819 341 839; x_wconf 96'>No:</span>
      <span class='ocrx_word' id='word_1_70' title='bbox 371 819 542 840; x_wconf 96'>4056111455</span>
     </span>
     </span>
     <span class='ocr_line' id='line_1_31' title="bbox 95 859 334 884; baseline -0.004 -4; x_size 26; x_descenders 5; x_ascenders 7">
      <span class='ocrx_word' id='word_1_67' title='bbox 95 859 175 880; x_wconf 96'>Terms</span>
      <span class='ocrx_word' id='word_1_68' title='bbox 185 859 210 880; x_wconf 96'>of</span>
      <span class='ocrx_word' id='word_1_69' title='bbox 218 859 334 884; x_wconf 96'>Payment:</span>
     </span>
    </p>
   </div>

I think this can be done by BautifulSoup, I have achieved till now is to add span ocr_line in a list, I would like to search in the span ocr_line and check if the bbox are close to each other shifted one point up or down in the x-axis or y-axis

from bs4 import BeautifulSoup
soup = BeautifulSoup(hocr_container,'html.parser')
lines = soup.find_all('span',class_='ocr_line')
for line in lines
# Check the bbox and concatenate span
Community
  • 1
  • 1
ahmed osama
  • 621
  • 2
  • 12
  • 21
  • AFAIK the fastest way of getting an element is via it's ID selector. You could write a regex for your desired IDs and fetch all ocrx_words and ocrx_line element by that regex, then parse them. A solution for that could be found here: https://stackoverflow.com/questions/2830530/matching-ids-in-beautifulsoup – BoboDarph Sep 04 '18 at 07:02
  • I have checked this but my problem is not parsing them, but merging spans. – ahmed osama Sep 04 '18 at 07:07
  • 1
    Are talking about re-building the page again with the BS elements you found and processed? If so, you can use BS's string representation of the element. Just call str(whatever_element_you_found) and append it's result to a file you generate as you do the parsing and ordering. Example here https://stackoverflow.com/questions/25729589/how-to-get-html-from-a-beautiful-soup-object and documentation here https://www.crummy.com/software/BeautifulSoup/bs4/doc/#non-pretty-printing – BoboDarph Sep 04 '18 at 07:17
  • So, we don't have find and replace method or shifting techniques – ahmed osama Sep 04 '18 at 11:13

1 Answers1

1

This may help you

from bs4 import BeautifulSoup
html = """
      <div class='ocr_carea' id='block_1_8' title="bbox 95 779 341 884">
<p class='ocr_par' id='par_1_16' lang='Latin' title="bbox 95 779 341 884">
 <span class='ocr_line' id='line_1_29' title="bbox 96 779 338 800; baseline 0 -1; x_size 25.636646; x_descenders 5.6366458; x_ascenders 5">
  <span class='ocrx_word' id='word_1_62' title='bbox 96 779 186 800; x_wconf 96'>Header</span>
  <span class='ocrx_word' id='word_1_63' title='bbox 195 779 338 799; x_wconf 96'>Information</span>
 </span>
 <span class='ocr_line' id='line_1_30' title="bbox 96 819 341 839; baseline 0 0; x_size 25.26087; x_descenders 5.2608695; x_ascenders 6">
  <span class='ocrx_word' id='word_1_64' title='bbox 96 819 212 839; x_wconf 96'>Purchase</span>
  <span class='ocrx_word' id='word_1_65' title='bbox 221 819 290 839; x_wconf 96'>Order</span>
  <span class='ocrx_word' id='word_1_66' title='bbox 300 819 341 839; x_wconf 96'>No:</span>
  <span class='ocrx_word' id='word_1_70' title='bbox 371 819 542 840; x_wconf 96'>4056111455</span>
 </span>
 </span>
 <span class='ocr_line' id='line_1_31' title="bbox 95 859 334 884; baseline -0.004 -4; x_size 26; x_descenders 5; x_ascenders 7">
  <span class='ocrx_word' id='word_1_67' title='bbox 95 859 175 880; x_wconf 96'>Terms</span>
  <span class='ocrx_word' id='word_1_68' title='bbox 185 859 210 880; x_wconf 96'>of</span>
  <span class='ocrx_word' id='word_1_69' title='bbox 218 859 334 884; x_wconf 96'>Payment:</span>
 </span>
</p>
</div>"""

 soup = BeautifulSoup(html, 'html.parser')
 tag = soup.find_all('span', attrs={'class':'ocr_line'})
 for i in tag:
     x = (' '.join(i.stripped_strings))
     print x
Sohan Das
  • 1,560
  • 2
  • 15
  • 16
  • I am sorry the input HTML was added wrong. I corrected it now. The idea is that I have two spans and I want to merge them based on condition – ahmed osama Sep 04 '18 at 06:51