0

I'm working on building a Java program that will download a copy of a website to a local machine while maintaining the original file hierarchy.

I'm using the following: To find CSS of form http://www.w3schools.com/css/css_howto.asp (note working)

private static final String HTML_CSS_TAG_PATTERN = "\\s*(?i)href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">\\s]+))";
private static final String CSS_TAG_PATTERN = "(?i)<link([^>]+)>(.+?)>";

To find images (working fine):

private static final String HTML_IMG_TAG_PATTERN = "\\s*(?i)src\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">\\s]+))";
private static final String IMG_TAG_PATTERN = "(?i)<img([^>]+)>(.+?)>";

To find links of form http://www.w3schools.com/html/html_links.asp (working fine)

private static final String HTML_A_HREF_TAG_PATTERN = "\\s*(?i)href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">\\s]+))";
private static final String HTML_A_TAG_PATTERN = "(?i)<a([^>]+)>(.+?)</a>";

The link and images are working fine, but the CSS file isn't. I would like it to extract the link to the CSS file so that I can save it. Could anyone help me with what I missed?

3 Answers3

1

Try: CSS_TAG_PATTERN

<link[^>]+?text/css[^>]*?>

will match

<link rel="stylesheet" type="text/css" href="//cdn.sstatic.net/stackoverflow/all.css?v=0eb8b68aff29">
andyf
  • 3,262
  • 3
  • 23
  • 37
  • It's throwing the following error: Exception in thread "main" java.lang.IndexOutOfBoundsException: No group 1 at java.util.regex.Matcher.group(Unknown Source) at CSSRegEx.grabHTMLLinks(CSSRegEx.java:42) at HTML.main(HTML.java:63) I think that means that it's finding the reference to the CSS, but there is still a mistake in CSS_TAG_PATTERN? – user2680842 Dec 03 '13 at 01:51
  • what is expected in group 1? – andyf Dec 04 '13 at 06:08
  • https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – paisley.london Sep 17 '18 at 14:31
0

To make sure you only get CSS stylesheets try following CSS_TAG_PATTERN:

<link.*\s+rel="stylesheet"([^>]+)>

This pattern will match the following two

    <link rel="stylesheet" type="text/css" href="theme.css">
    <link type="text/css" rel="stylesheet"  href="theme.css">

but not

    <link type="text/css" rel="license"  href="someStuff">
chili_h
  • 1
  • 1
0

Try this pattern

<link[.]+?text/css[.]*?>

It will match

<link rel="stylesheet" type="text/css" href="theme.css">
<link type="text/css" rel="stylesheet"  href="theme.css">
<link type="text/css" rel="license"  href="someStuff">
Abhishek Anand
  • 1,940
  • 14
  • 27