1

tl;dr version

I have paragraph which might contain quotations (e.g. "blah blah", 'this one also', etc). Now I have to replace this with latex style quotation (e.g. ``blah blah", `this also', etc) with the help of python 3.0.

Background

I have lots of plain text files (more than ~100). Now I have to make one single Latex document with content taken from these files after doing little text processing on them. I am using Python 3.0 for this purpose. Now I am able to make everything else (like escape characters, sections etc) work but in I am not able to get quotation marks properly.

I can find pattern with regex (as described here), but how do I replace it with given pattern? I don't know how to use "re.sub()" function in this case. Because there might be multiple instances of quotes in my string. There is this question related to this, but how do I implement this with python?

Community
  • 1
  • 1
Dexter
  • 1,421
  • 3
  • 22
  • 43

3 Answers3

1

Design Considerations

  1. I've only considered the regular "double-quotes" and 'single-quotes'. There may be other quotation marks (see this question)
  2. LaTeX end-quotes are also single-quotes - we don't want to capture a LaTeX double-end quote (e.g. ``LaTeX double-quote'') and mistake it as a single quote (around nothing)
  3. Word contractions and ownership 's contain single quotes (e.g. don't, John's). These are characterised with alpha characters surrounding both sides of the quote
  4. Regular nouns (plural ownership) have single-quotes after the word (e.g. the actresses' roles)

Solution

import re

def texify_single_quote(in_string):
    in_string = ' ' + in_string #Hack (see explanations)
    return re.sub(r"(?<=\s)'(?!')(.*?)'", r"`\1'", in_string)[1:]

def texify_double_quote(in_string):
    return re.sub(r'"(.*?)"', r"``\1''", in_string)

Testing

with open("test.txt", 'r') as fd_in, open("output.txt", 'w') as fd_out:
    for line in fd_in.readlines():

        #Test for commutativity
        assert texify_single_quote(texify_double_quote(in_string)) == texify_double_quote(texify_single_quote(in_string))

        line = texify_single_quote(line)
        line = texify_double_quote(line)
        fd_out.write(line)

Input file (test.txt):

# 'single', 'single', "double"
# 'single', "double", 'single'
# "double", 'single', 'single'
# "double", "double", 'single'
# "double", 'single', "double"
# I'm a 'single' person
# I'm a "double" person?
# Ownership for plural words; the peoples' 'rights'
# John's dog barked 'Woof!', and Fred's parents' 'loving' cat ran away.
# "A double-quoted phrase, with a 'single' quote inside"
# 'A single-quoted phrase with a "double quote" inside, with contracted words such as "don't"'
# 'A single-quoted phrase with a regular noun such as actresses' roles'

Output (output.txt):

# `single', `single', ``double''
# `single', ``double'', `single'
# ``double'', `single', `single'
# ``double'', ``double'', `single'
# ``double'', `single', ``double''
# I'm a `single' person
# I'm a ``double'' person?
# Ownership for plural words; the peoples' `rights'
# John's dog barked `Woof!', and Fred's parents' `loving' cat ran away.
# ``A double-quoted phrase, with a `single' quote inside''
# `A single-quoted phrase with a ``double quote'' inside, with contracted words such as ``don't'''
# `A single-quoted phrase with a regular noun such as actresses' roles'

(note comments were prepended to stop formatting on post's output!)

Explanations

We will break down this Regex pattern, (?<=\s)'(?!')(.*?)':

  • Summary: (?<=\s)'(?!') deals with the opening single-quote, whilst (.*?) deals with whats in the quotes.
  • (?<=\s)' is a positive look-behind and only matches single-quotes that have a whitespace (\s) preceding it. This is important to prevent matching contracted words such as can't (consideration 3, 4).
  • '(?!') is a negative look-ahead and only matches single-quotes that are not followed by another single-quote (consideration 2).
  • As mentioned in this answer, The pattern (.*?) captures what's in-between the quotation marks, whilst the \1 contains the capture.
  • The "Hack" in_string = ' ' + in_string is there because the positive look-behind does not capture single quotes starting at the beginning of the line, thus adding a space for all lines (then removing it on return with slicing, return re.sub(...)[1:]) solves this problem!
Community
  • 1
  • 1
Jamie Phan
  • 1,112
  • 7
  • 15
  • Thank you for such a nice explanation, however single quote function (texify_single_quote) is not working. :/ – Dexter Jan 24 '17 at 08:53
  • No worries! Can you tell me how it isn't working? Seems to work on my system just fine. – Jamie Phan Jan 24 '17 at 09:35
  • Ahh I see, I think it may be when we have a string like this: (This is my 'test' string, this is a ``double''). It could be that the single-quote ' after "test" and the single quotes after "double" are getting matched. Sorry, I got to read up more on Regex to find out an answer - will get back to you when I can. P.s. using brackets because I can't format the code block with all these quote marks everywhere! – Jamie Phan Jan 24 '17 at 10:38
  • Thank anyways. +1 for your efforts :) – Dexter Jan 24 '17 at 11:03
  • Hmm, so I tried it on a text file (see Edits), and things seem to work fine if you just use the single filter before the double filter - not sure if that's O.K (but will continue investigating a more 'robust' solution) – Jamie Phan Jan 26 '17 at 05:17
  • Nvm! Found my bug in `texify_single_quotes()`, in the `for match in re.finditer(r"'(.*?)'", in_string):` I was still performing the `re.sub` on the **entire** string and not the single `match.group(1)`. Fixed! Now it should work regardless of order – Jamie Phan Jan 26 '17 at 05:28
  • @Dexter, Okay, I'm quite certain this should work now (extra minor bugs fixed - will write up details/explainations) – Jamie Phan Jan 26 '17 at 05:40
1

regexes are great for some tasks but they are still limited (read this for more info). writing a parser for this task seems more prune to errors.

I created a simple function for this task and added comments. if still there are questions about the implementation please ask.

the code (online version here):

the_text = '''
This is my \"test\" String
This is my \'test\' String
This is my 'test' String
This is my \"test\" String which has \"two\" quotes
This is my \'test\' String which has \'two\' quotes
This is my \'test\' String which has \"two\" quotes
This is my \"test\" String which has \'two\' quotes
'''


def convert_quotes(txt, quote_type):
    # find all quotes
    quotes_pos = []
    idx = -1

    while True:
        idx = txt.find(quote_type, idx+1)
        if idx == -1:
            break
        quotes_pos.append(idx)

    if len(quotes_pos) % 2 == 1:
        raise ValueError('bad number of quotes of type %s' % quote_type)

    # replace quote with ``
    new_txt = []
    last_pos = -1

    for i, pos in enumerate(quotes_pos):
        # ignore the odd quotes - we dont replace them
        if i % 2 == 1:
            continue
        new_txt += txt[last_pos+1:pos]
        new_txt += '``'
        last_pos = pos

    # append the last part of the string
    new_txt += txt[last_pos+1:]

    return ''.join(new_txt)

print(convert_quotes(convert_quotes(the_text, '\''), '"'))

prints out:

This is my ``test" String
This is my ``test' String
This is my ``test' String
This is my ``test" String which has ``two" quotes
This is my ``test' String which has ``two' quotes
This is my ``test' String which has ``two" quotes
This is my ``test" String which has ``two' quotes

Note: parsing nested quotes is ambiguous.

for example: the string "bob said: "alice said: hello"" is nested on proper language

BUT:

the string "bob said: hi" and "alice said: hello" is not nested.

if this is your case you might want first to parse these nested quotes into different quotes or use parenthesis () for nested quotes disambiguation.

Community
  • 1
  • 1
ShmulikA
  • 3,468
  • 3
  • 25
  • 40
0

I've searched countless webpages trying to find a simple answer to this. Almost all the solutions I've seen assume a pairing of quotes. This can prove problematic in my case of writing length prose. Extended quotes may only have an open and not a close. Of course the case of single quotes with apostrophes this is a problem. Beyond that my quotes could appear in multiple lines. Maybe I'm naïve. But here's how I broke it down. This only applies to English.

  1. You need to replace double quote at the start of words with double backtick
  2. Once those are replaced all other double quotes are changed to double single quotes
  3. You need to replace single quotes at the start of words with a single backtick. All other occurrences of single quotes are including apostrophes within words can stay unchanged.

I think this is simple and elegant, but am I missing any edge cases?

Here is my fix quote function it seems to work on all the cases presented above.

import re

def fix_quotes(s):
    """
    Replace single and double quotes with their corresponding LaTeX-style
    equivalents, except for apostrophes which are left unchanged.
    """
    # Replace opening and closing double quotes with LaTeX-style equivalents
    s = re.sub(r'\B"\b', '``', s).replace('"',"''")
    # Replace opening single quote with LaTeX-style equivalents
    s = re.sub(r"\B'\b", '`', s)
    
    return s