I'm trying to clean a text to keep at most letters, numbers and most usual ponctuation marks. For example, I have sometimes '''words''' or ''words'' so I want to strip those multiple simple quotes. So far I've chosen to use two regex :
import re
tqre=re.compile('\'\'\'[^\']*\'\'\'') #for triple quotes
dqre=re.compile('\'\'[^\']*\'\'') #for "double" quotes
Then strip each match :
res1=tqre.sub(self.quoteExtract,text)
res2=dqre.sub(self.quoteExtract,res1)
where:
def quoteExtract(self,match):
return match.group().strip("'")
It looks like it works well for triple quote, but I've got many double simple quotes passing through, seems they are not caught. Is it because they are not really simple quotes but another lookalike signs ? Is there another way to handle them ?
Ex : In * ''Esquisse d'une grammaire comparée de l'arménien classique'', 1903.
the regex is not found.