0

I understand how to obtain the text from a specific div or span style from this question: How to find the most common span styles

Now the difficulty is trying to find all the span styles with font sizes larger than the most common one?

I suspect I should use regular expressions, but first I need to extract the specific most common font size?

Also, how do you determine "larger than" when the condition is a string?

Community
  • 1
  • 1
Veronica Cheng
  • 447
  • 4
  • 13
  • May be you should fetch all the styles in one list and iterate through it and store `font-size` of all style in one array(just number using regex) and you can find which number used most of the time and also you can find greater one. – Piyush S. Wanare Nov 28 '16 at 12:09

2 Answers2

0

This may help you:-

    from bs4 import BeautifulSoup
    import re

    usedFontSize = [] #list of all font number used

    #Find all the span contains style 
    spans = soup.find_all('span',style=True)
    for span in spans:
        #print span['style']
        styleTag = span['style']
        fontSize = re.findall("font-size:(\d+)px",styleTag)
        usedFontSize.append(int(fontSize[0]))

    #Find most commanly used font size
    from collections import Counter
    count = Counter(usedFontSize)
    #Print list of all the font size with it's accurence.
    print count.most_common()
Piyush S. Wanare
  • 4,703
  • 6
  • 37
  • 54
-1

To find all the span styles with font sizes larger than the most common span style using BeautifulSoup, you need to parse each CSS style that has been returned.

Parsing CSS is better done using a library such as cssutils. This would then let you access the fontSize attribute directly.

This would have a value such as 12px which does not naturally sort correctly. To get around this, you could use a library such as natsort.

So, first parse each of the styles into css objects. At the same time keep a list of all the soup for each span, along with the parsed CSS for the style.

Now use the fontSize attribute as the key for sorting with natsort. This would give you a correctly sorted list of styles according to their font size, largest first (by using reverse=True). takewhile() is then used to create a list of all entries in the list up to the point where the size matches the most common one resulting in a list of entries larger than the most common one.

from bs4 import BeautifulSoup
from collections import Counter
from itertools import takewhile    
import cssutils
import natsort

html = """
    <span style="font-family: ArialMT; font-size:12px">1</span>
    <span style="font-family: ArialMT; font-size:14px">2</span>
    <span style="font-family: ArialMT; font-size:1px">3</span>
    <span style="font-family: Arial; font-size:12px">4</span>
    <span style="font-family: ArialMT; font-size:18px">5</span>
    <span style="font-family: ArialMT; font-size:15px">6</span>
    <span style="font-family: ArialMT; font-size:12px">7</span>
    """

soup = BeautifulSoup(html, "html.parser")    
style_counts = Counter()
parsed_css_style = []       # Holds list of tuples (css_style, span)

for span in soup.find_all('span', style=True):
    style_counts[span['style']] += 1
    parsed_css_style.append((cssutils.parseStyle(span['style']), span))

most_common_style = style_counts.most_common(1)[0][0]
most_common_css_style = cssutils.parseStyle(most_common_style)
css_styles = natsort.natsorted(parsed_css_style, key=lambda x: x[0].fontSize, reverse=True)

print "Styles larger than most common font size of {} are:".format(most_common_css_style.fontSize)

for css_style, span in takewhile(lambda x: x[0].fontSize != most_common_css_style.fontSize, css_styles):
    print "  Font size: {:5}  Text: {}".format(css_style.fontSize, span.text)

In the example shown, the most commonly used font size is 12px, so there are 3 other entries larger than this as follows:

Styles larger than most common font size of 12px are:
  Font size: 18px   Text: 5
  Font size: 15px   Text: 6
  Font size: 14px   Text: 2

To install you will probably need:

pip install natsort
pip install cssutils    

Note, this does assume the font sizes used are consistent on your website, it is not able to compare different font metrics, only the numerical value.

Martin Evans
  • 45,791
  • 17
  • 81
  • 97
  • Thanks Martin, I'm still testing it, because I need to use the font-size information back to the soup function to capture the content. Just a tiny question first, why != gives larger than? It should be not equal to? btw, no idea why someone clicked on not helpful..... – Veronica Cheng Nov 29 '16 at 11:28
  • The list at that point is already sorted with largest to smallest, so the `takewhile` is used to read entries out of the full list until it matches the `12px` entry. – Martin Evans Nov 29 '16 at 11:30
  • Thanks Martin, I've tried to bring back the font size back to the soup row.get function using regular expression, but it gave this error: AttributeError: 'ResultSet' object has no attribute 'get' – Veronica Cheng Nov 29 '16 at 11:51
  • Sorry, not near a PC for a few days now. If you need the associated soup for a given entry, it just needs to be stored at the same time as getting the stykes. I will update it when I get a chance. – Martin Evans Nov 29 '16 at 19:30
  • I have updated the script to give you access to the `soup` for each of the span entries. – Martin Evans Dec 02 '16 at 09:18