How can I design my regex script to scrape for a very specific attribute, like color?

Question

I was practicing writing scraper-script this weekend. My method was to adapt a scraper that I've worked with before, from scraping for stock "price" to scraping for an attribute: the colors used in a website. I've researched libraries and tools, such as lxml and beautiful soup and attempted some debugging, but I can't quite figure it.

Goal: return a list of all of the colors used on a website

This is what I wrote:

import urllib
import re

url="https://cloud.google.com/edu"
htmlfile = urllib.urlopen(url)
htmlsource = htmlfile.read()

regex = '<color:#aaa>'
pattern = re.compile(regex)
color = re.findall(pattern, htmlsource)
print "color", color

What I keep getting in return is: color

1) Your "regex" is just a single string, so that's all it captures . 2) Don't use regex for HTML 3) is that a valid HTML tag or are you trying to parse CSS ? — OneCricketeer, Jun 19 '17 at 01:04
[**Don't parse HTML with regex.**](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — Jonathon Reinhart, Jun 19 '17 at 02:05
Thank you, cricket! Upon stepping away from the problem I realized that I had only included a piece of not the whole expression. — , Jun 22 '17 at 03:57
Also wondering, should I specify that it should fuzzy match to color + a few characters to get the color codes? To your #3, I used the "Inspect" function when I started looking, and there are CSS packages linked that include the color references I'm looking for...so I guess I'm trying to parse CSS! Does that change my process? — , Jun 22 '17 at 04:14

score 1 · Answer 1 · answered Jun 19 '17 at 02:01

regex = '<color:#aaa>' is going to catch strings that look exactly like '<color:#aaa>'

If you look at the source code of the page you're trying to scrape ( view-source:https://cloud.google.com/edu/ ), and you do a search with your browser (ctrl+f) you'll notice that the string '<color:#aaa>' is not present anywhere.

If you wanted to grab the colors used on that page, you'd have to retrieve styling substrings that basically looked like these:

color:#xxx
color:rgba(x,x,x,x)

But those could vary slightly:

6 digits hex color, instead of 3 digits
rgb with 3 arguments instead of rgba with 4 arguments
'color' is not named just 'color', but 'something-else-color'
random spacing inside the substring
etc

That's where regular expressions come handy. We could craft a couple of regular expressions that handled these scenarios (or a single, big, ugly regular expression for both). A couple of quick, incomplete examples, from the top of my head:

'color\:\#(?:\d{6}|\d{3})'
'color\:rgba?\(\d+\,\d+\,\d+\)'

I imagine you could strip the substring 'color:' from the occurrences afterward

Thank you L. Alvarez! I'm going to read about CSS and substring matching before I ask for more help. Thanks. — , Jun 22 '17 at 04:21

How can I design my regex script to scrape for a very specific attribute, like color?

1 Answers1