How can I find a file name in a block of text using python?

Question

I have gotten the HTML of a webpage using Python, and I now want to find all of the .CSS files that are linked to in the header. I tried partitioning, as shown below, but I got the error "IndexError: string index out of range" upon running it and save each as its own variable (I know how to do this part).

sytle = src.partition(".css")
style = style[0].partition('<link href=')
print style[2]
c =1

I do no think that this is the right way to approach this, so would love some advice. Many thanks in advance. Here is a section of the kind of text I am needing to extract .CSS file(s) from.

    <meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0" />

<!--[if gte IE 7]><!-->
<link href="/stylesheets/master.css?1342791430" media="screen, projection" rel="stylesheet" type="text/css" />

<link href="/stylesheets/adapt.css?1342791413" media="screen, projection" rel="stylesheet" type="text/css" />
<!-- <![endif]-->
<link href="/stylesheets/print.css?1342791421" media="print" rel="stylesheet" type="text/css" />
<link href="/apple-touch-icon-precomposed.png" rel="apple-touch-icon-precomposed" />
<link href="http://dribbble.com/shots/popular.rss" rel="alternate" title="RSS" type="application/rss+xml" />

You've accepted an answer that seems to have upvotes for some strange reason. Using a regex to parse HTML is just ugly, error prone, liable to break and inflexible. You should be using a proper HTML parser to process HTML data [lxml.html, BeautifulSoup, etc...) HTML is structured data, it's not just "text" — Jon Clements, Jul 26 '12 at 22:14

Ben Ruijl · Accepted Answer · 2012-07-26T22:09:53.120

4

You should use regular expression for this. Try the following:

/href="(.*\.css[^"]*)/g

EDIT

import re
matches = re.findall('href="(.*\.css[^"]*)', html)
print(matches)

edited Jul 26 '12 at 22:09

answered Jul 26 '12 at 21:55

Ben Ruijl

4,973
3
31
44

I've expanded my answer. Does this help? – Ben Ruijl Jul 26 '12 at 22:10

score 2 · Answer 2 · edited May 23 '17 at 12:13

My answer is along the same lines as Jon Clements' answer, but I tested mine and added a drop of explanation.

You should not use a regex. You can't parse HTML with a regex. The regex answer might work, but writing a robust solution is very easy with lxml. This approach is guaranteed to return the full href attribute of all <link rel="stylesheet"> tags and no others.

from lxml import html

def extract_stylesheets(page_content):
    doc = html.fromstring(page_content)                        # Parse
    return doc.xpath('//head/link[@rel="stylesheet"]/@href')   # Search

There is no need to check the filenames, since the results of the xpath search are already known to be stylesheet links, and there's no guarantee that the filenames will have a .css extension anyway. The simple regex will catch only a very specific form, but the general html parser solution will also do the right thing in cases such as this, where the regex would fail miserably:

<link REL="stylesheet" hREf = 

     '/stylesheets/print?1342791421'
  media="print"
><!-- link href="/css/stylesheet.css" -->

It could also be easily extended to select only stylesheets for a particular media.

Jon Clements · Answer 3 · 2012-07-26T22:35:13.110

For what it's worth (using lxml.html) as a parsing lib.

untested

import lxml.html
from urlparse import urlparse

sample_html = """<meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0" />

<!--[if gte IE 7]><!-->
<link href="/stylesheets/master.css?1342791430" media="screen, projection" rel="stylesheet" type="text/css" />

<link href="/stylesheets/adapt.css?1342791413" media="screen, projection" rel="stylesheet" type="text/css" />
<!-- <![endif]-->
<link href="/stylesheets/print.css?1342791421" media="print" rel="stylesheet" type="text/css" />
<link href="/apple-touch-icon-precomposed.png" rel="apple-touch-icon-precomposed" />
<link href="http://dribbble.com/shots/popular.rss" rel="alternate" title="RSS" type="application/rss+xml" />
"""

import lxml.html
page = lxml.html.fromstring(html)
link_hrefs = (p.path for p in map(urlparse, page.xpath('//head/link/@href')))
for href in link_hrefs:
    if href.rsplit(href, 1)[-1].lower() == 'css': # implement smarter error handling here
        pass # do whatever

How can I find a file name in a block of text using python?

3 Answers3