2

Let's say we have the following text:

<a href="link">some link</a> How to transform "ordinary quotes" to «Guillemets»

What is needed is to transform it to

<a href="link">some link</a> How to transform «ordinary quotes» to «Guillemets»

using regex and Python.

I've tried

import re

content = '<a href="link">some link</a> How to transform "ordinary quotes" to «Guillemets»'

res = re.sub('(?:"([^>]*)")(?!>)', '«\g<1>»', content)

print(res)

but, as @Wiktor Stribiżew noticed, this won't work if one or more tags will have multiple attributes, so

<a href="link" target="_blank">some link</a> How to transform "ordinary quotes" to «Guillemets»

will be transformed to

<a href=«link" target=»_blank">some link</a> How to transform «ordinary quotes» to «Guillemets»

Update

Please note that text

  • can be html, i.e:

<div><a href="link" target="_blank">some link</a> How to transform "ordinary quotes" to «Guillemets»</div>

  • can not be html, i.e.:

How to transform "ordinary quotes" to «Guillemets»

  • can not be html, but include some html tags, i.e.

<a href="link" target="_blank">some link</a> How to transform "ordinary quotes" to «Guillemets»

mr_bulrathi
  • 514
  • 7
  • 23
  • Your PHP pre_replace can be written as `re.sub(r'"([^>]*)"(?!>)', r'«\1»', content)` but I doubt it will do what you need. – Wiktor Stribiżew Apr 07 '19 at 08:05
  • 3
    Use a HTML parser. – Toto Apr 07 '19 at 08:47
  • 3
    As Toto commented above, you [really really shouldn't write your own HTML parser](https://stackoverflow.com/a/1732454/519360). Use one that already exists to exclude the HTML parts and then make your replacements within just the text nodes. – Adam Katz Apr 09 '19 at 17:19
  • 2
    @AdamKatz, I've followed link that you've applied and achieved enlightenment =) – mr_bulrathi Apr 09 '19 at 17:28

3 Answers3

0

When you have a hammer, everything looks like a nail. You don't have to use regex. A simple state machine will do (assuming anything inside <> is a HTML tag ).

# pos - current position in a string
# q1,q2 - opening and closing quotes position
s = ' How to transform "ordinary quotes" to «Guillemets» and " more <div><a href="link" target="_blank">some "bad" link</a>'
sl = list(s)
q1, q2 = 0, 0
pos = 0
while 1:
    tag_open = s.find('<', pos)
    q1 = s.find('"', pos)
    if q1 < 0:
        break   # no more quotation marks
    elif tag_open >= 0 and q1 > tag_open:
        pos = s.find('>', tag_open)     # tag close
    elif (tag_open >= 0 and q1 < tag_open) or tag_open < 0:
        q2 = s.find('"', q1 + 1)
        if q2 > 0 and (tag_open < 0 or q2 < tag_open):
            sl[q1] = '«'
            sl[q2] = '»'
            s = ''.join(sl)
            pos = q2
        else:
            pos = q1 + 1
print(s)

explanation:

 Scan your string, 
   If not inside tag, 
       find first and second quotation marks,
       replace accordingly, 
       continue scanning from the second quotation marks 
   Else
       continue to end of tag
igrinis
  • 12,398
  • 20
  • 45
-1

This works for me:

res = re.sub('(?:"([^>]*)")(?!>)', '«\g<1>»', content)

From the docs:

In addition to character escapes and backreferences as described above, \g will use the substring matched by the group named name, as defined by the (?P...) syntax. \g uses the corresponding group number; \g<2> is therefore equivalent to \2, but isn’t ambiguous in a replacement such as \g<2>0. \20 would be interpreted as a reference to group 20, not a reference to group 2 followed by the literal character '0'. The backreference \g<0> substitutes in the entire substring matched by the RE.

Merig
  • 1,751
  • 2
  • 13
  • 18
  • 1
    [It won't work for tags with two or more attributes](https://regex101.com/r/XDgH4W/2). No need to use `\g<1>` as `\1` is enough with a raw string literal. – Wiktor Stribiżew Apr 07 '19 at 08:07
-1

Are you willing to do this in three passes: [a] swap out the quotes inside HTML; [b] swap remaining quotes for guillemets; [c] restore the quotes inside HTML?

Remember that lookaheads are costly before complaining about the speed of this.

[a] first = re.sub(r'<.*?>', lambda x: re.sub(r'"', '', x.group(0)), content)
[b] second = re.sub(r'"(.*?)"', r'«\1»', first)
[c] third = re.sub(r'', '"', second)

Re Louis's comment:

first = re.sub(r'<.*?>', lambda x: re.sub(r'"', 'WILLSWAPSOON', x.group(0)), content)

There are scenenarios where the above strategy will work. Maybe OP is working within one of them. Otherwise, if all of this fussing is too much, OP can head over to BeautifulSoup and start playing with it...

  • Multiple issues in this answer. First, it is possible to use a better placeholder than an emoticon. For instance, `` are necessarily attribute delimiters but that's generally not true. For instance, ` – Louis Apr 10 '19 at 16:40
  • Third, it assumes that `>` necessarily marks the end of a tag, but that's also not generally true. Open a browser console and run `foo = document.createElement("a");` and `foo.setAttribute("q", ">")` and then check the value of `foo.outerHTML`. You'll get this in Chrome `""`. You can also pass that string to `DOMParser` and it will parse it just fine. Ultimately, this answer requires that the input be from a subset of possible HTML inputs. – Louis Apr 10 '19 at 16:49