3

I have a web service that rewrites urls in css files so that they can be served via a CDN.

The css files can contain urls to images or fonts.

I currently have the following regex to match ALL urls within the css file:

(url\(\s*([\'\"]?+))((?!(https?\:|data\:|\.\.\/|\/))\S+)((\2)\s*\))

However, I now want to introduce support for custom fonts and need to target the urls within @font-fontface:

@font-face {
  font-family: 'FontAwesome';
  src: url("fonts/fontawesome-webfont.eot?v=4.0.3");
  src: url("fonts/fontawesome-webfont.eot?#iefix&v=4.0.3") format("embedded-opentype"), url("fonts/fontawesome-webfont.woff?v=4.0.3") format("woff"), url("fonts/fontawesome-webfont.ttf?v=4.0.3") format("truetype"), url("fonts/fontawesome-webfont.svg?v=4.0.3#fontawesomeregular") format("svg");
  font-weight: normal;
  font-style: normal;
}

I then came up with the following:

@font-face\s*\{.*(url\(\s*([\'\"]?+))((?!(https?\:|data\:|\.\.\/|\/))\S+)((\2)\s*\))\s*\}

The problem is that this matches everything and not just the urls inside. I thought I can use lookbehind like so:

(?<=@font-face\s*\{.*)(url\(\s*([\'\"]?+))((?!(https?\:|data\:|\.\.\/|\/))\S+)((\2)\s*\))(?<=-\s*\})

Unfortunately, PCRE (which PHP uses) does not support variable repetition within a lookbehind, so I am stuck.

I do not wish to check for fonts by their extension as some fonts have the .svg extension which can conflict with images with the .svg extension.

In addition, I would also like to modify my original regex to match all other urls that are NOT within an @font-face:

.someclass {
  background: url('images/someimage.png') no-repeat;
}

Since I am unable to use lookbehinds, how can I extract the urls from those within a @font-face and those that are not within a @font-face?

F21
  • 32,163
  • 26
  • 99
  • 170
  • Do you need only to extract or do you want to be able to make a replace after? – Casimir et Hippolyte Jan 27 '14 at 22:15
  • I want to do a `preg_replace()`. Sorry for the confusion. I will edit my question :) – F21 Jan 27 '14 at 22:17
  • Why do you want to exclude urls that begins with "http"? can you give examples of the kind of replacements you want to do? – Casimir et Hippolyte Jan 27 '14 at 22:27
  • Because those are fully defined urls. In those cases, the author of the css file wants to point to some specific location, so we should not modify them. I only want to rewrite urls that are relative or only contain folders and filenames. – F21 Jan 27 '14 at 22:29
  • Consider using a PHP CSS parser like: https://github.com/sabberworm/PHP-CSS-Parser – Dean Taylor Jan 27 '14 at 22:41
  • It seems to be a bit overkill to bring in a full library to do that. I am just rewriting the urls and sending the files out, so don't need the ability to do lots of fancy css transformations. :) – F21 Jan 27 '14 at 22:43
  • Your regular expressions are unlikely to deal with "real world" CSS files which contain comments, a parser of some kind would be a requirement unless you control the content of original CSS files. – Dean Taylor Jan 27 '14 at 22:49

2 Answers2

14

Disclaimer : You're maybe off using a library, because it's tougher than you think. I also want to start this answer on how to match URL's that are not within @font-face {}. I also suppose/define that the brackets {} are balanced within @font-face {}.
Note : I'm going to use "~" as delimiters instead of "/", this will releave me from escaping later on in my expressions. Also note that I will be posting online demos from regex101.com, on that site I'll be using the g modifier. You should remove the g modifier and just use preg_match_all().
Let's use some regex Fu !!!

Part 1 : matching url's that are not within @font-face {}

1.1 Matching @font-face {}

Oh yes, this might sound "weird" but you will notice later on why :)
We'll need some recursive regex here:

@font-face\s*    # Match @font-face and some spaces
(                # Start group 1
   \{            # Match {
   (?:           # A non-capturing group
      [^{}]+     # Match anything except {} one or more times
      |          # Or
      (?1)       # Recurse/rerun the expression of group 1
   )*            # Repeat 0 or more times
   \}            # Match }
)                # End group 1

demo

1.2 Escaping @font-face {}

We'll use (*SKIP)(*FAIL) just after the previous regex, it will skip it. See this answer to get an idea how it works.

demo

1.3 Matching url()

We'll use something like this:

url\s*\(         # Match url, optionally some whitespaces and then (
\s*              # Match optionally some whitespaces
("|'|)           # It seems that the quotes are optional according to http://www.w3.org/TR/CSS2/syndata.html#uri
(?!["']?(?:https?://|ftp://))  # Put your negative-rules here (do not match url's with http, https or ftp)
(?:[^\\]|\\.)*?  # Match anything except a backslash or backslash and a character zero or more times ungreedy
\2               # Match what was matched in group 2
\s*              # Match optionally some whitespaces
\)               # Match )

Note that I'm using \2 because I've appended this to the previous regex which has group 1.
Here's another use of ("|')(?:[^\\]|\\.)*?\1.

demo

1.4 Matching the value inside url()

You might have guessed we need to use some lookaround-fu, the problem is with a lookbehind since it needs to be fixed length. I've got a workaround for that, I'll introduce you to the \K escape sequence. It will reset the beginning of the match to the current position in the token list. more-info
Well let's drop \K somewhere in our expression and use a lookahead, our final regex will be :

@font-face\s*    # Match @font-face and some spaces
(                # Start group 1
   \{            # Match {
   (?:           # A non-capturing group
      [^{}]+     # Match anything except {} one or more times
      |          # Or
      (?1)       # Recurse/rerun the expression of group 1
   )*            # Repeat 0 or more times
   \}            # Match }
)                # End group 1
(*SKIP)(*FAIL)   # Skip it
|                # Or
url\s*\(         # Match url, optionally some whitespaces and then (
\s*              # Match optionally some whitespaces
("|'|)           # It seems that the quotes are optional according to http://www.w3.org/TR/CSS2/syndata.html#uri
\K               # Reset the match
(?!["']?(?:https?://|ftp://))  # Put your negative-rules here (do not match url's with http, https or ftp)
(?:[^\\]|\\.)*?  # Match anything except a backslash or backslash and a character zero or more times ungreedy
(?=              # Lookahead
   \2            # Match what was matched in group 2
   \s*           # Match optionally some whitespaces
   \)            # Match )
)

demo

1.5 Using the pattern in PHP

We'll need to escape some things like quotes, backslashes \\\\ = \, use the right function and the right modifiers:

$regex = '~
@font-face\s*    # Match @font-face and some spaces
(                # Start group 1
   \{            # Match {
   (?:           # A non-capturing group
      [^{}]+     # Match anything except {} one or more times
      |          # Or
      (?1)       # Recurse/rerun the expression of group 1
   )*            # Repeat 0 or more times
   \}            # Match }
)                # End group 1
(*SKIP)(*FAIL)   # Skip it
|                # Or
url\s*\(         # Match url, optionally some whitespaces and then (
\s*              # Match optionally some whitespaces
("|\'|)          # It seems that the quotes are optional according to http://www.w3.org/TR/CSS2/syndata.html#uri
\K               # Reset the match
(?!["\']?(?:https?://|ftp://))  # Put your negative-rules here (do not match url's with http, https or ftp)
(?:[^\\\\]|\\\\.)*?  # Match anything except a backslash or backslash and a character zero or more times ungreedy
(?=              # Lookahead
   \2            # Match what was matched in group 2
   \s*           # Match optionally some whitespaces
   \)            # Match )
)
~xs';

$input = file_get_contents($css_file);
preg_match_all($regex, $input, $m);
echo '<pre>'. print_r($m[0], true) . '</pre>';

demo

Part 2 : matching url's that are within @font-face {}

2.1 Different approach

I want to do this part in 2 regexes because it will be a pain to match URL's that are within @font-face {} while taking care of the state of braces {} in a recursive regex.

And since we already have the pieces we need, we'll only need to apply them in some code:

  1. Match all @font-face {} instances
  2. Loop through these and match all url()'s

2.2 Putting it into code

$results = array(); // Just an empty array;
$fontface_regex = '~
@font-face\s*    # Match @font-face and some spaces
(                # Start group 1
   \{            # Match {
   (?:           # A non-capturing group
      [^{}]+     # Match anything except {} one or more times
      |          # Or
      (?1)       # Recurse/rerun the expression of group 1
   )*            # Repeat 0 or more times
   \}            # Match }
)                # End group 1
~xs';

$url_regex = '~
url\s*\(         # Match url, optionally some whitespaces and then (
\s*              # Match optionally some whitespaces
("|\'|)          # It seems that the quotes are optional according to http://www.w3.org/TR/CSS2/syndata.html#uri
\K               # Reset the match
(?!["\']?(?:https?://|ftp://))  # Put your negative-rules here (do not match url\'s with http, https or ftp)
(?:[^\\\\]|\\\\.)*?  # Match anything except a backslash or backslash and a character zero or more times ungreedy
(?=              # Lookahead
   \1            # Match what was matched in group 2
   \s*           # Match optionally some whitespaces
   \)            # Match )
)
~xs';

$input = file_get_contents($css_file);

preg_match_all($fontface_regex, $input, $fontfaces); // Get all font-face instances
if(isset($fontfaces[0])){ // If there is a match then
    foreach($fontfaces[0] as $fontface){ // Foreach instance
        preg_match_all($url_regex, $fontface, $r); // Let's match the url's
        if(isset($r[0])){ // If there is a hit
            $results[] = $r[0]; // Then add it to the results array
        }
    }
}
echo '<pre>'. print_r($results, true) . '</pre>'; // Show the results

demo

                                                                    Join the regex chatroom !

Community
  • 1
  • 1
HamZa
  • 14,671
  • 11
  • 54
  • 75
  • 1
    Your explanation really helped me a lot in understanding the more advanced things in regex! :) Unfortunately, I am unable to accept multiple answers as I ended up using Casimir et Hippolyte's solution with some modifications. Nevertheless, I have given you an upvote! – F21 Jan 28 '14 at 02:33
  • 1
    Great job +1 <°)))))))> – Casimir et Hippolyte Jan 28 '14 at 02:40
4

You can use this:

$pattern = <<<'LOD'
~
(?(DEFINE)
    (?<quoted_content>
        (["']) (?>[^"'\\]++ | \\{2} | \\. | (?!\g{-1})["'] )*+ \g{-1}
    )
    (?<comment> /\* .*? \*/ )
    (?<url_skip> (?: https?: | data: ) [^"'\s)}]*+ )
    (?<other_content>
        (?> [^u}/"']++ | \g<quoted_content> | \g<comment>
          | \Bu | u(?!rl\s*+\() | /(?!\*) 
          | \g<url_start> \g<url_skip> ["']?+
        )++
    )
    (?<anchor> \G(?<!^) ["']?+ | @font-face \s*+ { )
    (?<url_start> url\( \s*+ ["']?+ )
)

\g<comment> (*SKIP)(*FAIL) |

\g<anchor> \g<other_content>?+ \g<url_start> \K [./]*+ 

( [^"'\s)}]*+ )    # url
~xs
LOD;

$result = preg_replace($pattern, 'http://cdn.test.com/fonts/$8', $data);
print_r($result);

test string

$data = <<<'LOD'
@font-face {
  font-family: 'FontAwesome';
  src: url("fonts/fontawesome-webfont.eot?v=4.0.3");
  src: url(fonts/fontawesome-webfont.eot?#iefix&v=4.0.3) format("embedded-opentype"),
     /*url("fonts/fontawesome-webfont.woff?v=4.0.3") format("woff"),*/
       url("http://domain.com/fonts/fontawesome-webfont.ttf?v=4.0.3") format("truetype"),
       url('fonts/fontawesome-webfont.svg?v=4.0.3#fontawesomeregular') format("svg");
  font-weight: normal;
  font-style: normal;
}
/*
@font-face {
  font-family: 'Font1';
  src: url("fonts/font1.eot");
} */
@font-face {
  font-family: 'Fon\'t2';
  src: url("fonts/font2.eot");
}
@font-face {
  font-family: 'Font3';
  src: url("../fonts/font3.eot");
}
LOD;

Main idea:

For more readability the pattern is divided into named subpatterns. The (?(DEFINE)...) doesn't match anything, it is only a definition section.

The main trick of this pattern is the use of the \G anchor that means: start of the string or contiguous to a precedent match. I added a negative lookbehind (?<!^) to avoid the first part of this definition.

The <anchor> named subpattern is the most important because it allows a match only if @font-face { is found or immediately after the end of an url (this is the reason why you can see a ["']?+).

<other_content> represents all that is not an url section but matches url sections that must be skipped too(urls that begin with "http:", "data:"). The important detail of this subpattern is that it can't match the closing curly bracket of @font-face.

The mission of <url_start> is only to match url(".

\K resets all the substring that has been matched before from the match result.

([^"'\s)}]*+) matches the url (the only thing that stay in the match result with the leading ./../ )

Since <other_content> and the url subpattern can't match a } (that is outside quoted or comment parts), you are sure to never match something outside of the @font-face definition, the second consequence is that the pattern always fails after the last url. Thus, at the next attempt the "contiguous branch" will fail until the next @font-face.

another trick:

The main pattern begins with \g<comment> (*SKIP)(*FAIL) | to skip all content inside comments /*....*/. \g<comment> refers to the basic subpattern that describes how a comment look like. (*SKIP) forbids to retry the substring that has been matched before (on his left, by g<comment>), if the pattern fails on his right. (*FAIL) forces the pattern to fail. With this trick, comments are skipped and are not a match result (since the pattern fails).

subpatterns details:

quoted_content: It's used in <other_content> to avoid to match url( or /* that are inside quotes.

(["'])              # capture group: the opening quote
(?>                 # atomic group: all possible content between quotes
    [^"'\\]++       # all that is not a quote or a backslash
  |                 # OR
    \\{2}           # two backslashes: (two \ doesn't escape anything)
  |                 # OR
    \\.             # any escaped character
  |                 # OR
    (?!\g{-1})["']  # the other quote (this one that is not in the capture group)
)*+                 # repeat zero or more time the atomic group
\g{-1}              # backreference to the last capturing group

other_content: all that is not the closing curly bracket, or an url without http: or data:

(?>                     # open an atomic group
    [^u}/"']++          # all character that are not problematic!
  |
    \g<quoted_content>  # string inside quotes
  |
    \g<comment>         # string inside comments
  |
    \Bu                 # "u" not preceded by a word boundary
  |
    u(?!rl\s*+\()       # "u" not followed by "rl("  (not the start of an url definition)
  |                   
    /(?!\*)             # "/" not followed by "*" (not the start of a comment)
  |
    \g<url_start>       # match the url that begins with "http:"
    \g<url_skip> ["']?+ # until the possible quote
)++                     # repeat the atomic group one or more times

anchor

\G(?<!^) ["']?+    # contiguous to a precedent match with a possible closing quote
|                  # OR
@font-face \s*+ {  # start of the @font-face definition

Notice:

You can improve the main pattern:

After the last url of @font-face, the regex engine attempts to match with the "contiguous branch" of <anchor> and match all characters until the } that makes the pattern fail. Then, on each same characters, the regex engine must try the two branches or <anchor> (that will always fail until the }.

To avoid these useless tries, you can change the main pattern to:

\g<comment> (*SKIP)(*FAIL) |

\g<anchor> \g<other_content>?+
(?>
    \g<url_start> \K [./]*+  ([^"'\s)}]*+)
  | 
    } (*SKIP)(*FAIL)
)

With this new scenario, the first character after the last url is matched by the "contiguous branch", \g<other_content> matches all until the }, \g<url_start> fails immediatly, the } is matched and (*SKIP)(*FAIL) make the pattern fail and forbids to retry these characters.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • That is awesome! Do you mind adding some comments for the regex? I am a total regex noob and would love to find out how it works :) – F21 Jan 27 '14 at 23:56
  • Also, is there anyway to invert this so that I can match the urls that are not within a `@font-face`? – F21 Jan 27 '14 at 23:57
  • I also noticed if I add an `http://` to the any url within the first `font-face`, the rest of the urls without `http://` are not captured. – F21 Jan 28 '14 at 00:01
  • I just discovered that it also breaks if the font url looks like: `.../fonts/blah.eot` or `/fonts/blah.eot` – F21 Jan 28 '14 at 00:25
  • Wow! Great explanation! I have never used any subpatterns before. Quick question about the resulting url matches: if they start with `/` or include anything that looks like `/../` or `../../` is it possible to drop them? – F21 Jan 28 '14 at 00:49
  • Sorry, I realized I made a mistake in my comment. I meant to say:if they start with `/` or include anything that looks like `/../` or `../../` is it possible to drop the `/` `/../` `../../` etc? Say if `../font/font.eot` was matched, just return `font/font.eot`? – F21 Jan 28 '14 at 00:56
  • Argh, I forgot the comments part. +1 – HamZa Jan 28 '14 at 01:25
  • Is it possible to have capture groups with this solution? I have been trying to add `(` and `)` around ``but it says `a numbered reference must not be zero`. – F21 Jan 28 '14 at 01:50
  • @F21: of course, like this: `(\g)` in `$matches[8]` – Casimir et Hippolyte Jan 28 '14 at 01:55
  • Thanks! That did it. But I am now stuck again :( I now have the urls like `fonts/fontawesome-webfont.eot?v=4.0.3`. And in another capture group, I have the font part of the urls: `url(`. What I want to do now is to include that front part and the modified url and replace it: `preg_replace( $pattern, '${8}' . 'http://cdn.test.com/fonts/' . '${0}', $data);` The problem is that the result is now: `url("../url("http://test.com/fonts/fonts/fontawesome-webfont.eot?v=4.0.3")` – F21 Jan 28 '14 at 02:05
  • @F21: You don't have to capture `url(`, you only need to add your path in the replacement string: `'http://cdn.test.com/fonts/$0'` – Casimir et Hippolyte Jan 28 '14 at 02:18
  • But that results in `url("../http://test.com/fonts/fonts/fontawesome-webfont.eot?v=4.0.3")` because the regex strips out the `../` `/` and other relative parts in front. – F21 Jan 28 '14 at 02:20
  • Maybe I need a capture group for the whole original untouched url, and a second one for the url with the relative parts stripped out. – F21 Jan 28 '14 at 02:20
  • @F21: You can simply put the url in a capturing group and move the `\K` before the `[./]*+`. I let you find the number of the capturing group as exercise. – Casimir et Hippolyte Jan 28 '14 at 02:28
  • ah that's it! Moving the `\K` did the job! – F21 Jan 28 '14 at 02:31
  • Just one more question! How can I allow my capture group to include the closing parenthesis? Currently, the captured result looks like this `url("../fonts/fontawesome-webfont.eot?v=4.0.3'`, but I would like to include the closing parenthesis: `url("../fonts/fontawesome-webfont.eot?v=4.0.3')`. It seems that the closing parenthesis is not matched together and is a separate match, so I am a bit lost as to where I should add the match for `\)`. – F21 Jan 28 '14 at 02:41
  • DOH! Your new way is even simplier than what I am trying to do! :) – F21 Jan 28 '14 at 02:50
  • Ah, I have found a flaw with that approach. If the url looks like: `url("/fontawesome-webfont.eot?v=4.0.3")`. The result turns into `url("/http://cdn.test.com/fonts/fonts/fontawesome-webfont.eot?v=4.0.3");` – F21 Jan 28 '14 at 02:56
  • @F21: No, because the slash is removed too with `[./]*+` – Casimir et Hippolyte Jan 28 '14 at 03:00
  • Ah, I just saw your new edit. That solves the problem. Thanks for your help today :) It's been a rough ride and I still have a long way to go with regexes :) – F21 Jan 28 '14 at 03:03