1

I wrote a function that parses all headers based on header's tags (h1/2...). Now I want to expand on it and add a feature that parses text based on font-size - say either 20px or 1.5em, regardless of the headers. I want a feature that brings any text written in font-size greater than X, wherever it is on the page. The function takes json file as an input, composed of a random HTML (and whatever website could have, i.e. CSS etc) in it.

Based on crummy it seems like one possible option is to use soup.fetch(), however, I haven't found many examples using it for this purpose.

Since font-size well might appear under CSS component I'm not sure that bs4 is the right package for it. I assume the answer includes cssutils or tinycss but haven't been able to find the best way to use those for this task.

As a reference - My code for header's tags was posted for a review: https://codereview.stackexchange.com/questions/166671/extract-html-content-based-on-tags-specifically-headers/166674?noredirect=1#comment317280_166674.

Posts I've checked: What is the pythonic way to implement a css parser/replacer ;
Find all the span styles with font size larger than the most common one via beautiful soup python ;
Search in HTML page using Regex patterns with python ;
How to parse a web page containing CSS and HTML using python ;
how to extract text within font tag using beautifulsoup ;
Extract text with bold content from css selector

Thanks much,

oba2311
  • 373
  • 4
  • 12
  • How does it parse the text based on the font size? Do you mean that you know which header tag has what font size? – Moon Cheesez Jun 27 '17 at 11:32
  • Thanks for the comment @MoonCheesez . I mean regardless of the headers, I want a feature that brings any text written in font-size greater than X. I'll edit for clarity - thanks. – oba2311 Jun 27 '17 at 11:49

0 Answers0