1

I was practicing writing scraper-script this weekend. My method was to adapt a scraper that I've worked with before, from scraping for stock "price" to scraping for an attribute: the colors used in a website. I've researched libraries and tools, such as lxml and beautiful soup and attempted some debugging, but I can't quite figure it.

Goal: return a list of all of the colors used on a website

This is what I wrote:

import urllib
import re

url="https://cloud.google.com/edu"
htmlfile = urllib.urlopen(url)
htmlsource = htmlfile.read()

regex = '<color:#aaa>'
pattern = re.compile(regex)
color = re.findall(pattern, htmlsource)
print "color", color

What I keep getting in return is: color

  • 1) Your "regex" is just a single string, so that's all it captures . 2) Don't use regex for HTML 3) is that a valid HTML tag or are you trying to parse CSS ? – OneCricketeer Jun 19 '17 at 01:04
  • [**Don't parse HTML with regex.**](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Jonathon Reinhart Jun 19 '17 at 02:05
  • Thank you, cricket! Upon stepping away from the problem I realized that I had only included a piece of not the whole expression. –  Jun 22 '17 at 03:57
  • Also wondering, should I specify that it should fuzzy match to color + a few characters to get the color codes? To your #3, I used the "Inspect" function when I started looking, and there are CSS packages linked that include the color references I'm looking for...so I guess I'm trying to parse CSS! Does that change my process? –  Jun 22 '17 at 04:14

1 Answers1

1

regex = '<color:#aaa>' is going to catch strings that look exactly like '<color:#aaa>'

If you look at the source code of the page you're trying to scrape ( view-source:https://cloud.google.com/edu/ ), and you do a search with your browser (ctrl+f) you'll notice that the string '<color:#aaa>' is not present anywhere.

If you wanted to grab the colors used on that page, you'd have to retrieve styling substrings that basically looked like these:

  1. color:#xxx
  2. color:rgba(x,x,x,x)

But those could vary slightly:

  • 6 digits hex color, instead of 3 digits
  • rgb with 3 arguments instead of rgba with 4 arguments
  • 'color' is not named just 'color', but 'something-else-color'
  • random spacing inside the substring
  • etc

That's where regular expressions come handy. We could craft a couple of regular expressions that handled these scenarios (or a single, big, ugly regular expression for both). A couple of quick, incomplete examples, from the top of my head:

  1. 'color\:\#(?:\d{6}|\d{3})'
  2. 'color\:rgba?\(\d+\,\d+\,\d+\)'

I imagine you could strip the substring 'color:' from the occurrences afterward

Luis Alvarez
  • 674
  • 6
  • 8
  • Thank you L. Alvarez! I'm going to read about CSS and substring matching before I ask for more help. Thanks. –  Jun 22 '17 at 04:21