removing special apostrophes from French article contractions when tokenizing

Question

I am currently running an stm (structural topic model) of a series of articles from the french newspaper Le Monde. The model is working just great, but I have a problem with the pre-processing of the text. I'm currently using the quanteda package and the tm package for doing things like removing words, removing numbers...etc... There's only one thing, though, that doesn't seem to work. As some of you might know, in French, the masculine determinative article -le- contracts in -l'- before vowels. I've tried to remove -l'- (and similar things like -d'-) as words with removeWords

lmt67 <- removeWords(lmt67, c( "l'","d'","qu'il", "n'", "a", "dans"))

but it only works with words that are separate from the rest of text, not with the articles that are attached to a word, such as in -l'arbre- (the tree). Frustrated, I've tried to give it a simple gsub

lmt67 <- gsub("l'","",lmt67)

but that doesn't seem to be working either. Now, what's a better way to do this, and possibly through a c(...) vector so that I can give it a series of expressions all together?

Just as context, lmt67 is a "large character" with 30,000 elements/articles, obtained by using the "texts" functions on data imported from txt files.

Thanks to anyone that will want to help me.

you probably need to escape the `'` character with a backslash \ like so: `gsub("\'","", c( "l'","d'","qu'il", "n'", "a", "dans"))` — Joseph Wood, Mar 01 '18 at 23:30
I'm not sure I understand. The ' character is part of what I want to remove. Also, your gsub removes stuff from those expressions, instead of removing those expressions from the file. Am I right? Sorry if what I said doesn't make sense, I'm a beginner in this field. — kouta, Mar 01 '18 at 23:35
ok, I see what you're saying. So, something like gsub ("l\'","",lm67) should work... — kouta, Mar 01 '18 at 23:39
in case what I wrote above wasn't clear, I'll rewrite it with spaces between the characters gsub( " l \ ' " , " ", lm67) — kouta, Mar 01 '18 at 23:43
I think that will work, however I'm not sure why you need the or operator `|`. — Joseph Wood, Mar 01 '18 at 23:44
The apostrophe is NOT a regex meta-character, so should NOT require escaping. See `?regex`. This should work: `gsub("l'", "", lmt67)` if and only if that is really the same character as an ASCII single quote. — IRTFM, Mar 01 '18 at 23:47
to 42 note that what you're suggesting is precisely the command that didn't work. — kouta, Mar 01 '18 at 23:48
Yes. I think the problem is not what Joseph Wood thinks it is. I suspect it's a different character. — IRTFM, Mar 01 '18 at 23:49
@42, thanks for the save... I always get confused with these regex manipulations. I normally fall back on the adage coined by joran [here](https://stackoverflow.com/q/14879204/4408538). "_When in doubt, keep adding backslashes until it works_" — Joseph Wood, Mar 01 '18 at 23:50
@42 interesting....so what's your suggestion? are there any other functions that work like gsub? — kouta, Mar 01 '18 at 23:53
`gsub` is perfectly fine. You need to search for the correct character. — IRTFM, Mar 01 '18 at 23:57

score 1 · Answer 1 · answered Mar 01 '18 at 23:56

1

Here's a scrape from the current page at Le Monde's website. Notice that the apostrophe they use is not the same character as the single-quote here "'":

text <- "Réfugié à Bruxelles, l’indépendantiste catalan a désigné comme successeur Jordi Sanchez, partisan de l’indépendance catalane, actuellement en prison pour sédition."

It has a little angle and is not actually "straight down" when I view it. You need to copy that character into your gsub command:

sub("l’", "", text)
[#1] "Réfugié à Bruxelles, indépendantiste catalan a désigné comme successeur Jordi Sanchez, partisan de l’indépendance catalane, actuellement en prison pour sédition."

answered Mar 01 '18 at 23:56

IRTFM

258,963
21
364
487

you're totally right! I was probably using the wrong character!!! Now I'll try again with quanteda and see, but I suspect that this will solve it once and for all... – kouta Mar 01 '18 at 23:58
nope, just checked my txt files...the conversion I get makes them all into a common ' bummer...I was really hoping that was the solution – kouta Mar 02 '18 at 00:00
You need to post an example where the suggestions fail. At the moment this is an "it does work"-question where "it" is not defined. – IRTFM Mar 02 '18 at 00:12
I'm trying with the backslash first. I'm running my model and seeing what happens. It'll be a couple more minutes. – kouta Mar 02 '18 at 00:13

score 1 · Accepted Answer · answered Mar 02 '18 at 08:46

I'll outline two ways to do this using quanteda and quanteda-related tools. First, let's define a slightly longer text, with more prefix cases for French. Notice the inclusion of the ’ apostrophe as well as the ASCII 39 simple apostrophe.

txt <- c(doc1 = "M. Trump, lors d’une réunion convoquée d’urgence à la Maison Blanche, 
                 n’en a pas dit mot devant la presse. En réalité, il s’agit d’une 
                 mesure essentiellement commerciale de ce pays qui l'importe.", 
         doc2 = "Réfugié à Bruxelles, l’indépendantiste catalan a désigné comme 
                 successeur Jordi Sanchez, partisan de l’indépendance catalane, 
                 actuellement en prison pour sédition.")

The first method will use pattern matches for the simple ASCII 39 (apostrophe) plus a bunch of Unicode variants, matched through the category "Pf" for "Punctuation: Final Quote" category. However, quanteda does its best to normalize the quotes at the tokenization stage - see the "l'indépendance" in the second document for instance.

The second way below uses a French part-of-speech tagger integrated with quanteda that allows similar selection after recognizing and separating the prefixes, and then removing determinants (among other POS).

1. quanteda tokens

toks <- tokens(txt, remove_punct = TRUE)
# remove stopwords
toks <- tokens_remove(toks, stopwords("french"))
toks
# tokens from 2 documents.
# doc1 :
# [1] "M"               "Trump"           "lors"            "d'une"           "réunion"        
# [6] "convoquée"       "d'urgence"       "à"               "la"              "Maison"         
# [11] "Blanche"         "n'en"            "a"               "pas"             "dit"            
# [16] "mot"             "devant"          "la"              "presse"          "En"             
# [21] "réalité"         "il"              "s'agit"          "d'une"           "mesure"         
# [26] "essentiellement" "commerciale"     "de"              "ce"              "pays"           
# [31] "qui"             "l'importe"      
# 
# doc2 :
# [1] "Réfugié"           "à"                 "Bruxelles"         "l'indépendantiste"
# [5] "catalan"           "a"                 "désigné"           "comme"            
# [9] "successeur"        "Jordi"             "Sanchez"           "partisan"         
# [13] "de"                "l'indépendance"    "catalane"          "actuellement"     
# [17] "en"                "prison"            "pour"              "sédition"

Then, we apply the pattern to match l', d', or l', using a regular expression replacement on the types (the unique tokens):

toks <- tokens_replace(
    toks, 
    types(toks), 
    stringi::stri_replace_all_regex(types(toks), "[lsd]['\\p{Pf}]", "")
)
# tokens from 2 documents.
# doc1 :
# [1] "M"               "Trump"           "lors"            "une"             "réunion"        
# [6] "convoquée"       "urgence"         "à"               "la"              "Maison"         
# [11] "Blanche"         "n'en"            "a"               "pas"             "dit"            
# [16] "mot"             "devant"          "la"              "presse"          "En"             
# [21] "réalité"         "il"              "agit"            "une"             "mesure"         
# [26] "essentiellement" "commerciale"     "de"              "ce"              "pays"           
# [31] "qui"             "importe"        
# 
# doc2 :
# [1] "Réfugié"         "à"               "Bruxelles"       "indépendantiste" "catalan"        
# [6] "a"               "désigné"         "comme"           "successeur"      "Jordi"          
# [11] "Sanchez"         "partisan"        "de"              "indépendance"    "catalane"       
# [16] "actuellement"    "En"              "prison"          "pour"            "sédition"

From the resulting toks object you can form a dfm and then proceed to fit the STM.

2. using spacyr

This will involve more sophisticated part-of-speech tagging and then converting the tagged object into quanteda tokens. This requires first that you install Python, spacy, and the French language model. (See https://spacy.io/usage/models.)

library(spacyr)
spacy_initialize(model = "fr", python_executable = "/anaconda/bin/python")
# successfully initialized (spaCy Version: 2.0.1, language model: fr)

toks <- spacy_parse(txt, lemma = FALSE) %>%
    as.tokens(include_pos = "pos") 
toks
# tokens from 2 documents.
# doc1 :
# [1] "M./NOUN"                   "Trump/PROPN"               ",/PUNCT"                  
# [4] "lors/ADV"                  "d’/PUNCT"                  "une/DET"                  
# [7] "réunion/NOUN"              "convoquée/VERB"            "d’/ADP"                   
# [10] "urgence/NOUN"              "à/ADP"                     "la/DET"                   
# [13] "Maison/PROPN"              "Blanche/PROPN"             ",/PUNCT"                  
# [16] "\n                 /SPACE" "n’/VERB"                   "en/PRON"                  
# [19] "a/AUX"                     "pas/ADV"                   "dit/VERB"                 
# [22] "mot/ADV"                   "devant/ADP"                "la/DET"                   
# [25] "presse/NOUN"               "./PUNCT"                   "En/ADP"                   
# [28] "réalité/NOUN"              ",/PUNCT"                   "il/PRON"                  
# [31] "s’/AUX"                    "agit/VERB"                 "d’/ADP"                   
# [34] "une/DET"                   "\n                 /SPACE" "mesure/NOUN"              
# [37] "essentiellement/ADV"       "commerciale/ADJ"           "de/ADP"                   
# [40] "ce/DET"                    "pays/NOUN"                 "qui/PRON"                 
# [43] "l'/DET"                    "importe/NOUN"              "./PUNCT"                  
# 
# doc2 :
# [1] "Réfugié/VERB"              "à/ADP"                     "Bruxelles/PROPN"          
# [4] ",/PUNCT"                   "l’/PRON"                   "indépendantiste/ADJ"      
# [7] "catalan/VERB"              "a/AUX"                     "désigné/VERB"             
# [10] "comme/ADP"                 "\n                 /SPACE" "successeur/NOUN"          
# [13] "Jordi/PROPN"               "Sanchez/PROPN"             ",/PUNCT"                  
# [16] "partisan/VERB"             "de/ADP"                    "l’/DET"                   
# [19] "indépendance/ADJ"          "catalane/ADJ"              ",/PUNCT"                  
# [22] "\n                 /SPACE" "actuellement/ADV"          "en/ADP"                   
# [25] "prison/NOUN"               "pour/ADP"                  "sédition/NOUN"            
# [28] "./PUNCT"

Then we can use the default glob-matching to remove the parts of speech in which we are probably not interested, including the newline:

toks <- tokens_remove(toks, c("*/DET", "*/PUNCT", "\n*", "*/ADP", "*/AUX", "*/PRON"))
toks
# doc1 :
# [1] "M./NOUN"             "Trump/PROPN"         "lors/ADV"            "réunion/NOUN"        "convoquée/VERB"     
# [6] "urgence/NOUN"        "Maison/PROPN"        "Blanche/PROPN"       "n’/VERB"             "pas/ADV"            
# [11] "dit/VERB"            "mot/ADV"             "presse/NOUN"         "réalité/NOUN"        "agit/VERB"          
# [16] "mesure/NOUN"         "essentiellement/ADV" "commerciale/ADJ"     "pays/NOUN"           "importe/NOUN"       
# 
# doc2 :
# [1] "Réfugié/VERB"        "Bruxelles/PROPN"     "indépendantiste/ADJ" "catalan/VERB"        "désigné/VERB"       
# [6] "successeur/NOUN"     "Jordi/PROPN"         "Sanchez/PROPN"       "partisan/VERB"       "indépendance/ADJ"   
# [11] "catalane/ADJ"        "actuellement/ADV"    "prison/NOUN"         "sédition/NOUN"

Then we can remove the tags, which you probably don't want in your STM - but you could leave them if you prefer.

## remove the tags
toks <- tokens_replace(toks, types(toks), 
                       stringi::stri_replace_all_regex(types(toks), "/[A-Z]+$", ""))
toks
# tokens from 2 documents.
# doc1 :
# [1] "M."              "Trump"           "lors"            "réunion"         "convoquée"      
# [6] "urgence"         "Maison"          "Blanche"         "n’"              "pas"            
# [11] "dit"             "mot"             "presse"          "réalité"         "agit"           
# [16] "mesure"          "essentiellement" "commerciale"     "pays"            "importe"        
# 
# doc2 :
# [1] "Réfugié"         "Bruxelles"       "indépendantiste" "catalan"         "désigné"        
# [6] "successeur"      "Jordi"           "Sanchez"         "partisan"        "indépendance"   
# [11] "catalane"        "actuellement"    "prison"          "sédition"

From there, you can use the toks object to form your dfm and fit the model.

This is great, thank you so much! Also, I think that it has to do with removing the capital L' as well, since I had forgotten to do that, and I believe that the stemming got in the way. — kouta, Mar 02 '18 at 16:31

removing special apostrophes from French article contractions when tokenizing

2 Answers2

1. quanteda tokens

2. using spacyr