0

As I need to proceed many pdfs with different styles, I have an assumptions that the main content will be under the most appeared/common span style.

Is there a way to find the most appeared span style in beautifulsoup python?

This is a command I used to find a specific span style:

 font-family: CBCDEE+ArialMT; 
 font-size:12px':
 spans = soup.find_all('span',
                       attrs={'style': 'font-family: CBCDEE+ArialMT; font-size:12px'})`

Any ways to find the most appeared/common one? or basically, is there a way to have the span style list and count the appearance of different styles?

Many thanks.

RMS
  • 1,350
  • 5
  • 18
  • 35
Veronica Cheng
  • 447
  • 4
  • 13

2 Answers2

0

This may work for you:-

spans = soup.find_all('span',style=True)
for span in spans:
    print span['style']

This will print all the styles used in all span tag in your file.

Piyush S. Wanare
  • 4,703
  • 6
  • 37
  • 54
  • Many thanks. Now I got a new difficulty for finding all the span styles with font size larger than the most common one. I raised a new question: http://stackoverflow.com/questions/40843353/find-all-the-span-styles-with-font-size-larger-than-the-most-common-one-via-beau it will be much appreciated if you can give me some clue – Veronica Cheng Nov 28 '16 at 11:39
  • @VeronicaWenqianCheng, Check I have answer that question. – Piyush S. Wanare Nov 28 '16 at 12:29
0

You could use a Python Counter() to count all of the different styles and then display the most_common() element as follows:

from bs4 import BeautifulSoup
from collections import Counter

html = """
    <span style="font-family: CBCDEE+ArialMT; font-size:12px">1</span>
    <span style="font-family: CBCDEE+ArialMT; font-size:14px">2</span>
    <span style="font-family: CBCDEE+ArialMT; font-size:14px">3</span>
    <span style="font-family: CBCDEE+Arial; font-size:12px">4</span>
    <span style="font-family: CBCDEE+ArialMT; font-size:12px">5</span>"""

soup = BeautifulSoup(html, "html.parser")    
style_counts = Counter()

for span in soup.find_all('span', style=True):
    style_counts[span['style']] += 1

print style_counts.most_common(1)[0][0]

For this example it would display:

font-family: CBCDEE+ArialMT; font-size:12px
Martin Evans
  • 45,791
  • 17
  • 81
  • 97
  • You're welcome. If you are happy with the solution, don't forget to click on the grey tick next to it to mark it as the accepted answer, and get your Scholar badge. – Martin Evans Nov 23 '16 at 11:37
  • I'd like to do so, but I haven't got enough reputation to click...T T – Veronica Cheng Nov 23 '16 at 11:58
  • You should be able to click the tick, but not the up arrow. – Martin Evans Nov 23 '16 at 11:58
  • Done, Martin. Many thanks. Now I got a new difficulty for finding all the span styles with font size larger than the most common one. I raised a new question: http://stackoverflow.com/questions/40843353/find-all-the-span-styles-with-font-size-larger-than-the-most-common-one-via-beau it will be much appreciated if you can give me some clue – Veronica Cheng Nov 28 '16 at 11:37