1

I'm trying to write a large python/bash script which converts my html/css mockups to Shopify themes. One step in this process is changing out all the script sources. For instance:

<script type="text/javascript" src="./js/jquery.bxslider.min.js"></script>

becomes

<script type="text/javascript" src="{{ 'jquery.bxslider.min.js' | asset_url }}"></script>

Here is what I have so far:

import re
test = """
  <script type="text/javascript" src="./js/jquery-1.8.3.min.js"></script>
  <!--<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js" type="text/javascript"></script>-->
  <script type="text/javascript" src="./js/ie-amendments.js"></script>
  <script type="text/javascript" src="./js/jquery.bxslider.min.js"></script>
  <script type="text/javascript" src="./js/jquery.colorbox-min.js"></script>
  <script type="text/javascript" src="./js/main.js"></script>
"""
out = re.sub( 'src=\"(.+)\"', 'src="{{ \'\\1\' | asset_url }}"', test, flags=re.MULTILINE )
out

prints out

'\n  <script type="text/javascript" src="{{ \'./js/jquery-1.8.3.min.js\' | asset_url }}"></script>\n  <!--<script src="{{ \'http://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js" type="text/javascript\' | asset_url }}"></script>-->\n  <script type="text/javascript" src="{{ \'./js/ie-amendments.js\' | asset_url }}"></script>\n  <script type="text/javascript" src="{{ \'./js/jquery.bxslider.min.js\' | asset_url }}"></script>\n  <script type="text/javascript" src="{{ \'./js/jquery.colorbox-min.js\' | asset_url }}"></script>\n  <script type="text/javascript" src="{{ \'./js/main.js\' | asset_url }}"></script>\n'

I have two problems so far:

  1. Some of the backslash characters I'm using to escape the single quotes within my regexes are showing up in the output.

  2. My capture group is capturing the entire original source string, but I just need what comes after the last "/"

ANSWER: Per Martijn Pieters helpful suggestion, I checked out the ? regular expression operator, and came up with this solution, which perfectly solved my problem. Also, for the replacement expression, I encapsulated it in double-quotes as opposed to singles, and escaped the doubles, which ended up removing the unnecessary backslashes. Thanks guys!

re.sub( r'src=".+?([^/]+?\.js)"', "src=\"{{ '\\1' | asset_url }}\"", test, flags=re.MULTILINE )
James Heston
  • 923
  • 1
  • 8
  • 17
  • 1
    *Sigh* http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Cfreak Feb 20 '13 at 15:04
  • 1
    @Cfreak: This may be a limited enough case; script tags don't contain other tags and `src` attributes matching is constrained and regular enough. No need to knee-jerk here. – Martijn Pieters Feb 20 '13 at 15:06
  • @Cfreak: Actually didn't know that regexes were bad for parsing HTML. Thanks for the link - it was very informative! – James Heston Feb 20 '13 at 15:18
  • No problem. In Python a very good module for parsing HTML is BeautifulSoup http://www.crummy.com/software/BeautifulSoup/ – Cfreak Feb 20 '13 at 15:19
  • 1
    **Don't use regular expressions to parse HTML**. You cannot reliably parse HTML with regular expressions. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/python.html for examples of how to properly parse HTML with Python modules. – Andy Lester Feb 20 '13 at 15:38

2 Answers2

1

Your expression is working just fine; Python is just showing you a string literal instead, where you'd have to escape quotes to be able to re-use it as a python string.

If you print the value, no such escaping takes place:

>>> re.sub( 'src=\"(.+)\"', 'src="{{ \'\\1\' | asset_url }}"', test, flags=re.MULTILINE )
'\n  <script type="text/javascript" src="{{ \'./js/jquery-1.8.3.min.js\' | asset_url }}"></script>\n  <!--<script src="{{ \'http://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js" type="text/javascript\' | asset_url }}"></script>-->\n  <script type="text/javascript" src="{{ \'./js/ie-amendments.js\' | asset_url }}"></script>\n  <script type="text/javascript" src="{{ \'./js/jquery.bxslider.min.js\' | asset_url }}"></script>\n  <script type="text/javascript" src="{{ \'./js/jquery.colorbox-min.js\' | asset_url }}"></script>\n  <script type="text/javascript" src="{{ \'./js/main.js\' | asset_url }}"></script>\n'
>>> print(re.sub( 'src=\"(.+)\"', 'src="{{ \'\\1\' | asset_url }}"', test, flags=re.MULTILINE ))

  <script type="text/javascript" src="{{ './js/jquery-1.8.3.min.js' | asset_url }}"></script>
  <!--<script src="{{ 'http://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js" type="text/javascript' | asset_url }}"></script>-->
  <script type="text/javascript" src="{{ './js/ie-amendments.js' | asset_url }}"></script>
  <script type="text/javascript" src="{{ './js/jquery.bxslider.min.js' | asset_url }}"></script>
  <script type="text/javascript" src="{{ './js/jquery.colorbox-min.js' | asset_url }}"></script>
  <script type="text/javascript" src="{{ './js/main.js' | asset_url }}"></script>

You can use ? to make +, * and ? qualifiers non-greedy; matching the minimum instead of the maximum. You can also match anything that is not a quote instead:

r'src="([^"]+)"'

which constrains that part of the regular expression much better; [^"] matches any character that is not a double quote.

When specifying a regular expression pattern, it is generally a good idea to use a python raw string literal (r'') instead, saves a lot of headaches as to what needs escaping and what does not. Using a raw string literal, your substitution pattern can be simplified to:

r'src="{{ \'\1\' | asset_url }}"' 

for a final line of:

re.sub(r'src="([^"]+)"', r'src="{{ \'\1\' | asset_url }}"', test, flags=re.MULTILINE)
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
1

Without incurring the wrath of some you could treat it as xml.

txt = """<html>
<script type="text/javascript" src="./js/jquery.bxslider.min.js"></script>
<script type="text/javascript" src="./js/jquery.another.min.js"></script>
</html>
"""

import xml.etree.ElementTree as ET
import os
root = ET.fromstring(txt)

for e in root.findall('script'):
    e.attrib['src'] =  "{{ '%s' | assert_url }}" % os.path.basename(e.attrib['src'])

print ET.tostring(root)

gives:

<html>
<script src="{{ 'jquery.bxslider.min.js' | assert_url }}" type="text/javascript" />
<script src="{{ 'jquery.another.min.js' | assert_url }}" type="text/javascript" />
</html>

The xml doc may be useful thereafter; it all depends how nice and well formed you HTML is.

sotapme
  • 4,695
  • 2
  • 19
  • 20