1

I'd like to use regex to remove the apostrophes in common contractions. For example, I'd like to map

test1 test2 can't test3 test4 won't

to

test1 test2 cant test3 test4 wont

My current naive approach is just to manually sub all the contractions I want to use.

def remove_contraction_apostrophes(input):
    text = re.sub('can.t', 'cant', input)                                       
    text = re.sub('isn.t', 'isnt', text)                                       
    text = re.sub('won.t', 'wont', text)                                       
    text = re.sub('aren.t', 'arent', text)  
    return text

(I'm using can.t because in the text I am parsing, it can use multiple characters for the apostrophe, like can't and can`t).

This is pretty unwieldy as I want to add all the common contractions. Is there a better way of doing this with regex, where I could construct a regex of this type by inputting a list of contractions? Or am I better off just listing them all like this?

It also may be possible to just work with the endings, like 'll, n't etc, but I'm a afraid of catching other things besides contractions with this.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Christian Doucette
  • 1,104
  • 3
  • 12
  • 21
  • I actually like your current idea, but I would like more using a library which can already do this, so that you don't miss out on edge cases. – Tim Biegeleisen Dec 14 '20 at 05:06

4 Answers4

1

I would do something like this.

import re

def remove_contraction_apostraphes(input):
    text = re.sub('([A-Za-z]+)[\'`]([A-Za-z]+)', r'\1'r'\2', input)                                       
    return text

print(remove_contraction_apostraphes("can't"))

  1. It matches one or more letters [A-Za-z]+
  • things in square brackets means one of these characters, the plus means at least one or more of what comes before
  1. followed by one of the following ' or `

  2. followed by one or more letters

and replaces it with

  1. what was found in the first set of parenthesis r'\1'
  • r'\1' returns the pattern that was matched by the first ([A-Za-z]+)
  1. followed by what was found in the second set of parenthesis r'\2'

If you have other characters, such as �, and you know what they all are you can place them with the square brackets. This line will match any of those characters, and account for the chance of white spaces by the apostrophe

text = re.sub('([A-Za-z]+)\s?[\'`�]\s?([A-Za-z]+)', r'\1'r'\2', input)       
  • /s : Any white space
  • ? : 0 or 1 of the previous

You could also use [^A-Za-z0-9]

    text = re.sub('([A-Za-z]+)[^A-Za-z0-9]([A-Za-z]+)', r'\1'r'\2', input)     

to match any any number of character's followed by any character which isn't a letter or a number, followed by any number of character's. If you want to add the \s? in there, I would recommend adding \., \?, \!, \: ... to you regex making it '([A-Za-z]+)\s?[^A-Za-z0-9\.\!\?\:]s?([A-Za-z]+)' because otherwise your regex will match things like the ends of sentences, which are not contractions


This will match any contraction, no matter how letters before or after the apostrophe there are. You will need to put all the different apostrophe's that you have within the ['`] block

Sam
  • 1,765
  • 11
  • 82
  • 176
  • Thanks for the response. One thing: other characters than ' and ` are used for apostrophes. I'm parsing from a website with BeautifulSoup and some of the characters used as apostrophes are showing up as � characters. I had some trouble matching these directly with regex, which was what I was trying to avoid by using 'can.t' type matches. If I figure out how to fix that I'll definitely use this solution. – Christian Doucette Dec 14 '20 at 17:18
  • 1
    @ChristianDoucette You could try replacing ['`] with something like `[^A-Za-z0-9]`. The `^` inside of `[]` means anything but, so it's saying anything but the character's A through Z, a through z, or 0 through 9. You can place any other characters you want excluded as well in there – Sam Dec 14 '20 at 17:21
  • 1
    @ChristianDoucette or just place � inside the original `[]` in the answer – Sam Dec 14 '20 at 17:23
  • I was getting some weird behavior with the �, but I changed my encoding type so I'm no longer getting it. So, your solution works really well now. Just one more thing: how can I get it to match accented character as well? I think re.Unicide can be used to match these, but not sure how to fit that into the expression. – Christian Doucette Dec 14 '20 at 20:35
  • 1
    What's an accented character – Sam Dec 14 '20 at 20:47
  • Characters like á, é, ñ, Ö - any letters with an accent symbol on top. – Christian Doucette Dec 15 '20 at 01:00
  • 1
    @ChristianDoucette Apparently it's `[A-Za-zÀ-ÖØ-öø-ÿ]` [source](https://stackoverflow.com/a/26900132/6331353) – Sam Dec 15 '20 at 03:11
  • � is usually a placeholder for an invalid character; it's not uncommon for BS4 to get the character set wrong (usually because the web site incorrectly declares a different character set than the one it's actually using). You can try a library like https://github.com/LuminosoInsight/python-ftfy to maybe fix it up for you. – tripleee Dec 15 '20 at 04:39
1

Use look arounds to check for letters either side of an "apostrophe":

text = re.sub("(?<=\w)[‘`’'](?=\w)", '', input) 
                              

Look arounds assert, without consuming, preceding/following input.

—--

import re
input = "I can’t understand what's wrong"
text = re.sub("(?<=\w)[‘`’'](?=\w)", '', input)
print(text)

Produces

I cant understand whats wrong

See live demo.

Bohemian
  • 412,405
  • 93
  • 575
  • 722
1

Regular expressions let you easily list a set of alternatives.

def remove_contraction_apostrophes(input):
    text = re.sub(r'\b(are|ca|is|wo)n.t\b', r'\1nt', input)
    text = re.sub(r'\b(I|[hw]e|it|she|they|you).ll\b', r'\1ll', text)

In re.sub, the back reference \1 recalls the text which matched the first parenthesized subexpression, in the replacement too. (\2 gets the second, etc.)

Notice also the addition of word-boundary anchors \b to prevent the regex from matching in the middle of a longer word, like volcanity.

tripleee
  • 175,061
  • 34
  • 275
  • 318
1

You can just simply do like this:

t="test1 test2 can't test3 test4 won't"
re.sub("\'","",t)
  • Thanks for the response. But one thing I should have said in the question is that there are other apostrophes in the text that I want to keep. – Christian Doucette Dec 14 '20 at 17:11