9

I need to get all URLs (url() expressions) from CSS files. For example:

b { background: url(img0) }
b { background: url("img1") }
b { background: url('img2') }
b { background: url( img3 ) }
b { background: url( "img4" ) }
b { background: url( 'img5' ) }
b { background: url (img6) }
b { background: url ("img7") }
b { background: url ('img8') }
{ background: url('noimg0) }
{ background: url(noimg1') }
/*b { background: url(noimg2) }*/
b { color: url(noimg3) }
b { content: 'url(noimg4)' }
@media screen and (max-width: 1280px) { b { background: url(img9) } }
b { background: url(img10) }

I need to get all img* URLs, but not noimg* URLs (invalid syntax or invalid property or inside comments).

I've tried using good old regular expressions. After some trial and error I got this:

private static IEnumerable<string> ParseUrlsRegex (string source)
{
    var reUrls = new Regex(@"(?nx)
        url \s* \( \s*
            (
                (?! ['""] )
                (?<Url> [^\)]+ )
                (?<! ['""] )
                |
                (?<Quote> ['""] )
                (?<Url> .+? )
                \k<Quote>
            )
        \s* \)");
    return reUrls.Matches(source)
        .Cast<Match>()
        .Select(match => match.Groups["Url"].Value);
}

That's one crazy regex, but it still doesn't work -- it matches 3 invalid URLs (namely, 2, 3 and 4). Furthermore, everyone will say that using regex for parsing complex grammar is wrong.

Let's try another approach. According to this question, the only viable option is ExCSS (others are either too simple or outdated). With ExCSS I got this:

    private static IEnumerable<string> ParseUrlsExCss (string source)
    {
        var parser = new StylesheetParser();
        parser.Parse(source);
        return parser.Stylesheet.RuleSets
            .SelectMany(i => i.Declarations)
            .SelectMany(i => i.Expression.Terms)
            .Where(i => i.Type == TermType.Url)
            .Select(i => i.Value);
    }

Unlike regex solution, this one doesn't list invalid URLs. But it doesn't list some valid ones! Namely, 9 and 10. Looks like this is known issue with some CSS syntax, and it can't be fixed without rewriting the whole library from scratch. ANTLR rewrite seems to be abandoned.

Question: How to extract all URLs from CSS files? (I need to parse any CSS files, not only the one provided as an example above. Please don't heck for "noimg" or assume one-line declarations.)

N.B. This is not a "tool recommendation" question, as any solution will be fine, be it a piece of code, a fix to one of the above solutions, a library or anything else; and I've clearly defined the function I need.

Community
  • 1
  • 1
Athari
  • 33,702
  • 16
  • 105
  • 146
  • 1
    I tried to write a parser for this answer. Alas the css specification wasn't helpful _(See http://www.nczonline.net/blog/2011/01/11/the-sorry-state-of-the-css3-specifications/ and http://stackoverflow.com/questions/6977177/w3c-css-grammar-syntax-oddities)_. I think for this reason ExCSS missed some valid items. – Daniel Gimenez Aug 24 '13 at 01:45
  • Its even harder than you think. There is an additional case that should not match: URLs _within quoted strings_: e.g. `p[example="...url(link)..."] { color: red }`. (See: [the CSS spec](http://www.w3.org/TR/CSS2/syndata.html#rule-sets).) Thus, you cannot simply pluck out the urls - you must parse the CSS file from start to end and correctly handle all quoted strings, comments and CSS tokens. That said, I'm pretty sure a single (non-trivial) regex solution _can_ neatly do the trick, but will require using a callback function. Stand by... – ridgerunner Aug 26 '13 at 16:46
  • Do you have choice of language? I would solve the problem in Perl.. – Owen Beresford Aug 26 '13 at 21:43

9 Answers9

6

Finally got Alba.CsCss, my port of CSS parser from Mozilla Firefox, working.

First and foremost, the question contains two errors:

  1. url (img) syntax is incorrect, because space is not allowed between url and ( in CSS grammar. Therefore, "img6", "img7" and "img8" should not be returned as URLs.

  2. An unclosed quote in url function (url('img)) is a serious syntax error; web browsers, including Firefox, do not seem to recover from it and simply skip the rest of the CSS file. Therefore, requiring the parser to return "img9" and "img10" is unnecessary (but necessary if the two problematic lines are removed).

With CsCss, there are two solutions.

The first solution is to rely just on the tokenizer CssScanner.

List<string> uris = new CssLoader().GetUris(source).ToList();

This will return all "img" URLs (except mentioned in the error #1 above), but will also include "noimg3" as property names are not checked.

The second solution is to properly parse the CSS file. This will most closely mimic the behavior of browsers (including stopping parsing after an unclosed quote).

var css = new CssLoader().ParseSheet(source, SheetUri, BaseUri);
List<string> uris = css.AllStyleRules
    .SelectMany(styleRule => styleRule.Declaration.AllData)
    .SelectMany(prop => prop.Value.Unit == CssUnit.List
        ? prop.Value.List : new[] { prop.Value })
    .Where(value => value.Unit == CssUnit.Url)
    .Select(value => value.OriginalUri)
    .ToList();

If the two problematic lines are removed, this will return all correct "img" URLs.

(The LINQ query is complex, because background-image property in CSS3 can contain a list of URLs.)

Athari
  • 33,702
  • 16
  • 105
  • 146
  • You're correct with point 1, the grammar in the CSS spec is `"url(" whitespace (string or urlchar* ) whitespace ")"`. However it is reasonable that a User Agent wouldn't be as strict and allow the whitespace. – Daniel Gimenez Aug 26 '13 at 05:58
  • While regex can be used to parse CSS, it is a crime just like [p̟͕̝͞a̪̺ŗ̹̥͕s̹̯̺̗͕̼i̶̠̤̭̳̤̩n̢̞͇̰͖̭g̵̣̹͙̖̥̖͕ ̶͍̼̱͈͎͈̜H̴̰̻̗̭̭T̶͈̫̗̳͇̙̮M͓̗L̘̻̫͙ ̢͉ẉ͖̘̻͟i̧̳̼̥̪̹̟̜t̖h͍͖̰̭ ̷r͜e̜͎̣̦͞g̶̱̯̱̩ͅex̵̙̝̙͈](http://stackoverflow.com/a/1732454/293099). So only a solution with a parser should be accepted as correct. – Athari Aug 27 '13 at 06:28
  • Congratulations for the nice library! – Click Ok Jun 10 '18 at 18:44
5

RegEx is a very powerful tool. But when a bit more flexibility is needed, I prefer to just write a little code.

So for a non-RegEx solution, I came up with the following. Note that a bit more work would be needed to make this code more generic to handle any CSS file. For that, I would also use my text parsing helper class.

IEnumerable<string> GetUrls(string css)
{
    char[] trimChars = new char[] { '\'', '"', ' ', '\t', };

    foreach (var line in css.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries))
    {
        // Extract portion within curly braces (this version assumes all on one line)
        int start = line.IndexOf('{');
        int end = line.IndexOf('}', start + 1);
        if (start < 0 || end < 0)
            continue;
        start++; end--; // Remove braces

        // Get value portion
        start = line.IndexOf(':', start);
        if (start < 0)
            continue;

        // Extract value and trime whitespace and quotes
        string content = line.Substring(start + 1, end - start).Trim(trimChars);

        // Extract URL from url() value
        if (!content.StartsWith("url", StringComparison.InvariantCultureIgnoreCase))
            continue;
        start = content.IndexOf('(');
        end = content.IndexOf(')', start + 1);
        if (start < 0 || end < 0)
            continue;
        start++;
        content = content.Substring(start, end - start).Trim(trimChars);

        if (!content.StartsWith("noimg", StringComparison.InvariantCultureIgnoreCase))
            yield return content;
    }
}

UPDATE:

What you appear to be asking seems beyond the scope of a simple how-to question for stackoverflow. I do not believe you will get satisfactory results using regular expressions. You will need some code to parse your CSS, and handle all the special cases that come with it.

Since I've written a lot of parsing code and had a bit of time, I decided to play with this a bit. I wrote a simple CSS parser and wrote an article about it. You can read the article and download the code (for free) at A Simple CSS Parser.

My code parses a block of CSS and stores the information in data structures. My code separates and stores each property/value pair for each rule. However, a bit more work is still needed to get the URL from the property values. You will need to parse them from the property value.

The code I originally posted will give you a start of how you might approach this. But if you want a truly robust solution, then some more sophisticated code will be needed. You might want to take a look at my code to parse the CSS. I use techniques in that code that could be used to easy handle values such as url('img(1)'), such as parsing a quoted value.

I think this is a pretty good start. I could write the remaining code for you as well. But what's the fun in that. :)

Jonathan Wood
  • 65,341
  • 71
  • 269
  • 466
  • Again a version optimized for the sample I've provided in the question. I need to handle *any* CSS, not just the one provided above. The solution you've given is no better than regex. It'll also fail to parse `url('img(1)')`. – Athari Aug 21 '13 at 05:24
  • @Authari: I've done a lot of parsing code and could easily extend this to write code to parse CSS more generally, as I suggested in my answer. But then I'd need to know more about how you wanted it structured, etc, as there could potentially be a lot of information. Your question seemed more focus on how you could specifically get the URL value. – Jonathan Wood Aug 21 '13 at 15:46
  • 1
    Your code should also be comment-aware so that it doesn't parse comments. – Karl-Johan Sjögren Aug 25 '13 at 12:45
  • @Karl-JohanSjögren: Please see the update to my answer. The issue of comments has been thorough addressed. – Jonathan Wood Aug 25 '13 at 19:03
2

In my opinion you created too much complicated RegExp. The working one is following: url\s*[(][\s'""]*(?<Url>img[\w]*)[\s'""]*[)]. I will try to explain what I'm searching:

  1. Start with url
  2. Then all whitespaces after it (\s*)
  3. Next is exactly one left bracket ([(])
  4. The 0 or more chars like: whitespace, ", ' ([\s'""]*)
  5. Next the "URL" so something starting with img and ending with zero or more alpha-numeric chars ((?<Url>img[\w]*))
  6. Again 0 or more chars like: whitespace, ", ' ([\s'""]*)
  7. And end with right bracket [)]

The full working code:

        var source =
            "b { background: url(img0) }\n" +
            "b { background: url(\"img1\") }\n" +
            "b { background: url(\'img2\') }\n" +
            "b { background: url( img3 ) }\n" +
            "b { background: url( \"img4\" ) }\n" +
            "b { background: url( \'img5\' ) }\n" +
            "b { background: url (img6) }\n" +
            "b { background: url (\"img7\") }\n" +
            "b { background: url (\'img8\') }\n" +
            "{ background: url(\'noimg0) }\n" +
            "{ background: url(noimg1\') }\n" +
            "/*b { background: url(noimg2) }*/\n" +
            "b { color: url(noimg3) }\n" +
            "b { content: \'url(noimg4)\' }\n" +
            "@media screen and (max-width: 1280px) { b { background: url(img9) } }\n" +
            "b { background: url(img10) }";


        string strRegex = @"url\s*[(][\s'""]*(?<Url>img[\w]*)[\s'""]*[)]";
        var reUrls = new Regex(strRegex);

        var result = reUrls.Matches(source)
                           .Cast<Match>()
                           .Select(match => match.Groups["Url"].Value).ToArray();
        bool isOk = true;
        for (var i = 0; i <= 10; i++)
        {
            if (!result.Contains("img" + i))
            {
                Console.WriteLine("Missing img"+i);
                isOk = false;
            }
        }
        for (var i = 0; i <= 4; i++)
        {
            if (result.Contains("noimg" + i))
            {
                Console.WriteLine("Redundant noimg" + i);
                isOk = false;
            }
        }
        if (isOk)
        {
            Console.WriteLine("Yes. It is ok :). The result is:");
            foreach (var s in result)
            {
                Console.WriteLine(s);
            }

        }
        Console.ReadLine();
Piotr Stapp
  • 19,392
  • 11
  • 68
  • 116
  • 2
    `img` is just an example. This code needs to parse *any* CSS file. – Athari Aug 20 '13 at 08:43
  • so what is a difference between `img` and `noimg`? Syntax errors? – Piotr Stapp Aug 20 '13 at 09:06
  • Syntax errors, comments, invalid properties etc. — browsers will load "img" files, but wil not load "noimg" files. – Athari Aug 20 '13 at 11:03
  • Maybe I add a stupid question: what are you trying to achieve? I have an idea how to handle your problem, but it wont be easy to achive – Piotr Stapp Aug 20 '13 at 12:17
  • I want to download all files on which an HTML page depends. That requires getting all images used in CSS files which HTML page links to. – Athari Aug 20 '13 at 13:56
  • So downloading more files that is needed is not a problem for you. You will just have some redundant files (noimg*), but everything will work just fine. The following regexp `url\s*[(][\s]*(?([^"')]+|["][^"')]+["]|['][^"')]+[']))\s*[)]` will extract every img* + noimg[2,3,4]. You can optimize solution, if you remove all comments: http://stackoverflow.com/questions/5272167/using-regex-to-remove-css-comments – Piotr Stapp Aug 20 '13 at 14:58
1

Probably not the most elegant possible solution, but seems to do the job you need done.

public static List<string> GetValidUrlsFromCSS(string cssStr)
{
    //Enter properties that can validly contain a URL here (in lowercase):
    List<string> validProperties = new List<string>(new string[] { "background", "background-image" });

    List<string> validUrls = new List<string>();
    //We'll use your regex for extracting the valid URLs
    var reUrls = new Regex(@"(?nx)
        url \s* \( \s*
            (
                (?! ['""] )
                (?<Url> [^\)]+ )
                (?<! ['""] )
                |
                (?<Quote> ['""] )
                (?<Url> .+? )
                \k<Quote>
            )
        \s* \)");
    //First, remove all the comments
    cssStr = Regex.Replace(cssStr, "\\/\\*.*?\\*\\/", String.Empty);
    //Next remove all the the property groups with no selector
    string oldStr;
    do
    {
        oldStr = cssStr;
        cssStr = Regex.Replace(cssStr, "(^|{|})(\\s*{[^}]*})", "$1");
    } while (cssStr != oldStr);
    //Get properties
    var matches = Regex.Matches(cssStr, "({|;)([^:{;]+:[^;}]+)(;|})");
    foreach (Match match in matches)
    {
        string matchVal = match.Groups[2].Value;
        string[] matchArr = matchVal.Split(':');
        if (validProperties.Contains(matchArr[0].Trim().ToLower()))
        {
            //Since this is a valid property, extract the URL (if there is one)
            MatchCollection validUrlCollection = reUrls.Matches(matchVal);
            if (validUrlCollection.Count > 0)
            {
                validUrls.Add(validUrlCollection[0].Groups["Url"].Value);
            }
        }
    }
    return validUrls;
}
AlliterativeAlice
  • 11,841
  • 9
  • 52
  • 69
  • The choice was simple as this is the only complete regex solution without "cheating", that is, without assumption that CSS would look exactly like in the provided example. It is also the most maintainable regex solution as it does not try to fit all logic into one huge "clever" regex. – Athari Aug 27 '13 at 06:08
  • Some notes on the code quality: 1) Unless you're forced to use old version of .NET, `new List{ new string[] { a, b } }` can be rewritten as `new List{ a, b }`. 2) `validProperties` can be an array (declared outside function), as LINQ contains `Contains` method which works on arrays too. 3) Function can return `IEnumerable` and use `yield return` to return items. 4) I haven't checked yet, but the cycle `do while` seems unnecessary as `Regex.Replace` should replace all occurences. 5) Calls to `ToLower` should be replaced with `string.Equals` with ... – Athari Aug 27 '13 at 06:18
  • ... `StringComparison.OrdinalIgnoreCase` argument. – Athari Aug 27 '13 at 06:19
  • 1
    Although this'll probably work for 99.9% of the cases, to demonstrate why indeed a CSS-parser would be better (as noted by OP), this would fail: `content:'/*'; background:url(img1); content:'*/';` Just adding for future readers. – asontu Aug 28 '13 at 12:46
1

You can try this pattern like this there is more help full

@import ([""'])(?<url>[^""']+)\1|url\(([""']?)(?<url>[^""')]+)\2\)

Or

http://www.c-sharpcorner.com/uploadfile/rahul4_saxena/reading-and-parsing-a-css-file-in-Asp-Net/

Athari
  • 33,702
  • 16
  • 105
  • 146
Sajith
  • 856
  • 4
  • 19
  • 48
1

You need negative lookbehind to see if there is no /* without a following */ like this:

(?<!\/\*([^*]|\*[^\/])*)

This seems unreadable, it means:

(?<! -> preceding this match may not be:

\/\* -> /* (with escape slashes) followed by

([^*] -> any character that isn't *

|\*[^\/]) -> or a character that is *, but is itself followed by anything that isn't /

*) -> of this not a * or a * without a / character we can have 0 or more, and finally close the negative lookbehind

And you need positive lookbehind to see whether the property being set is a css property that accepts url() values. If you only are interested in background: and background-image: for instance, this would be the entire regex:

(?<!\/\*([^*]|\*[^\/])*)
(?<=background(?:-image)?:\s*)
url\s*\(\s*(('|")?)[^\n'"]+\1\s*\)

Since this version requires the css property background: or background-image: to precede the url(), it will not detect the 'url(noimg4)'. You could use simple pipes to add more accepted css properties: (?<=(?:border-image|background(?:-image)?):\s*)

I've used \1 rather than \k<Quote> because I'm not familiar with that syntax, which means you need the ?: to not capture unwanted subgroups. As far as I can test this works.

Finally I used [^\n'"] for the actual url because I understand from your comments that url('img(1)') should work and [^\)] from your OP won't parse that.

asontu
  • 4,548
  • 1
  • 21
  • 29
  • 1) CSS allows comments inside declarations, AFAIK, so checking for comments only on the declaration's boundaries is incorrect. 2) If you want to make your regex more readable without exaplaining every symbol, you can use [`(?n)`](http://msdn.microsoft.com/en-us/library/yd1hzczs.aspx#Explicit) and [`(?x)`](http://msdn.microsoft.com/en-us/library/yd1hzczs.aspx#Whitespace) options. 3) See [backreference constructs](http://msdn.microsoft.com/en-us/library/thwdfzxy.aspx) to learn about `\k` syntax. – Athari Aug 25 '13 at 12:53
  • Ah, yeah I suppose this wouldn't accept `background:/* something */ url(img3)` which is valid of course. – asontu Aug 27 '13 at 11:42
1

This solution can avoid comments, and deals with background-image. It deals too with background which can contain properties like background-color, background-position, or repeat, that is not the case with background-image. This is why I have added these cases: noimg5, img11, img12.

The datas:

string subject =
    @"b { background: url(img0) }
      b { background: url(""img1"") }
      b { background: url('img2') }
      b { background: url( img3 ) }
      b { background: url( ""img4"" ) }
      b { background: url( 'img5' ) }
      b { background: url (img6) }
      b { background: url (""img7"") }
      b { background: url ('img8') }
      { background: url('noimg0) }
      { background: url(noimg1') }
      /*b { background: url(noimg2) }*/
      b { color: url(noimg3) }
      b { content: 'url(noimg4)' }
      @media screen and (max-width: 1280px) { b { background: url(img9) } }
      b { background: url(img10) }
      b { background: #FFCC66 url('img11') no-repeat }
      b { background-image: url('img12'); }
      b { background-image: #FFCC66 url('noimg5') }";

The pattern:

Comments are avoided because they are matched first. If a comment is leave open (without */, then all the content after is considered as a comment (?>\*/|$).

The result is stored in the named capture url.

string pattern = @"
        /\*  (?> [^*] | \*(?!/) )*  (?>\*/|$)  # comments
      |
        (?<=
            background
            (?>
                -image \s* :     # optional '-image'
              |
                \s* :
                (?>              # allowed content before url 
                    \s*
                    [^;{}u\s]+   # all that is not a ; { } u
                    \s           # must be followed by one space at least
                )?
            )

            \s* url \s* \( \s*
            ([""']?)             # optional quote (single or double) in group 1
        )
        (?<url> [^""')\s]+ )     # named capture 'url' with an url inside
        (?=\1\s*\))              # must be followed by group 1 content (optional quote)
              ";
RegexOptions options = RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace;
Match m = Regex.Match(subject, pattern, options);
List<string> urls = new List<string>();
while (m.Success)
{
    string url = m.Groups["url"].ToString();
    if (url!="") {
        urls.Add(url);
        Console.WriteLine(url);
    }
    m = m.NextMatch();
}
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
1

For such a problem the simpler approach could do the trick.

  1. Break all the css comands in lines (supose the css is simplified), in this case I would break in the ";" or "}" command.

  2. Read all the occurences inside url(*), even the wrong ones.

  3. Create a pipeline with command pattern that detects wich lines are really eligible

    • 3.1 Command1 (Detect comment)
    • 3.2 Command2 (Detect syntax error URL)
    • 3.3 ...
  4. With the OK lines flagged, extract the OK Url's

This is a simple approach and solves the problem with efficiency and no ultra complex unmanageble magical Regex.

Roger Barreto
  • 2,004
  • 1
  • 17
  • 21
1

This RegEx seems to solve the example provided:

background: url\s*\(\s*(["'])?\K\w+(?(1)(?=\1)|(?=\s*\)))(?!.*\*/)
animuson
  • 53,861
  • 28
  • 137
  • 147
alpha bravo
  • 7,838
  • 1
  • 19
  • 23