Design Considerations
- I've only considered the regular
"double-quotes"
and 'single-quotes'
. There may be other quotation marks (see this question)
- LaTeX end-quotes are also single-quotes - we don't want to capture a LaTeX double-end quote (e.g. ``LaTeX double-quote'') and mistake it as a single quote (around nothing)
- Word contractions and ownership
's
contain single quotes (e.g. don't
, John's
). These are characterised with alpha characters surrounding both sides of the quote
- Regular nouns (plural ownership) have single-quotes after the word (e.g.
the actresses' roles
)
Solution
import re
def texify_single_quote(in_string):
in_string = ' ' + in_string #Hack (see explanations)
return re.sub(r"(?<=\s)'(?!')(.*?)'", r"`\1'", in_string)[1:]
def texify_double_quote(in_string):
return re.sub(r'"(.*?)"', r"``\1''", in_string)
Testing
with open("test.txt", 'r') as fd_in, open("output.txt", 'w') as fd_out:
for line in fd_in.readlines():
#Test for commutativity
assert texify_single_quote(texify_double_quote(in_string)) == texify_double_quote(texify_single_quote(in_string))
line = texify_single_quote(line)
line = texify_double_quote(line)
fd_out.write(line)
Input file (test.txt
):
# 'single', 'single', "double"
# 'single', "double", 'single'
# "double", 'single', 'single'
# "double", "double", 'single'
# "double", 'single', "double"
# I'm a 'single' person
# I'm a "double" person?
# Ownership for plural words; the peoples' 'rights'
# John's dog barked 'Woof!', and Fred's parents' 'loving' cat ran away.
# "A double-quoted phrase, with a 'single' quote inside"
# 'A single-quoted phrase with a "double quote" inside, with contracted words such as "don't"'
# 'A single-quoted phrase with a regular noun such as actresses' roles'
Output (output.txt
):
# `single', `single', ``double''
# `single', ``double'', `single'
# ``double'', `single', `single'
# ``double'', ``double'', `single'
# ``double'', `single', ``double''
# I'm a `single' person
# I'm a ``double'' person?
# Ownership for plural words; the peoples' `rights'
# John's dog barked `Woof!', and Fred's parents' `loving' cat ran away.
# ``A double-quoted phrase, with a `single' quote inside''
# `A single-quoted phrase with a ``double quote'' inside, with contracted words such as ``don't'''
# `A single-quoted phrase with a regular noun such as actresses' roles'
(note comments were prepended to stop formatting on post's output!)
Explanations
We will break down this Regex pattern, (?<=\s)'(?!')(.*?)'
:
- Summary:
(?<=\s)'(?!')
deals with the opening single-quote, whilst (.*?)
deals with whats in the quotes.
(?<=\s)'
is a positive look-behind and only matches single-quotes that have a whitespace (\s
) preceding it. This is important to prevent matching contracted words such as can't
(consideration 3, 4).
'(?!')
is a negative look-ahead and only matches single-quotes that are not followed by another single-quote (consideration 2).
- As mentioned in this answer, The pattern
(.*?)
captures what's in-between the quotation marks, whilst the \1
contains the capture.
- The "Hack"
in_string = ' ' + in_string
is there because the positive look-behind does not capture single quotes starting at the beginning of the line, thus adding a space for all lines (then removing it on return with slicing, return re.sub(...)[1:]
) solves this problem!