I'm trying to pull a book using the Gutenberg library and then remove French stopwords. I've been able to do this accurately in English by doing this:
twistEN <- gutenberg_download(730)
twistEN <- twistEN[118:nrow(twistEN),]
twistEN <- twistEN %>%
unnest_tokens(word, text)
data(stop_words)
twistEN <- twistEN %>%
anti_join(stop_words)
countsEN <- twistEN %>%
count(word, sort=TRUE)
top.en <- countsEN[1:20,]
I can see here that the top 20 words (by frequency) in the English version of Oliver Twist are these:
word n
<chr> <int>
1 oliver 746
2 replied 464
3 bumble 364
4 sikes 344
5 time 329
6 gentleman 309
7 jew 294
8 boy 291
9 fagin 291
10 dear 277
11 door 238
12 head 226
13 girl 223
14 night 218
15 sir 210
16 lady 209
17 hand 205
18 eyes 204
19 rose 201
20 cried 182
I'm trying to accomplish the same thing with the French version of the same novel:
twistFR <- gutenberg_download(16023)
twistFR <- twistFR[123:nrow(twistFR),]
twistFR <- twistFR %>%
unnest_tokens(word, text)
stop_french <- data.frame(word = stopwords::stopwords("fr"), stringsAsFactors = FALSE)
stop_french <- get_stopwords("fr","snowball")
as.data.frame(stop_french)
twistFR <- twistFR %>%
anti_join(stop_words, by = c('word')) %>%
anti_join(stop_french, by = c("word"))
countsFR <- twistFR %>%
count(word, sort=TRUE)
top.fr <- countsFR[1:20,]
I did alter the code for the French stopwords based on info I found online, and it is removing some stopwords. But this is the list I'm getting:
word n
<chr> <int>
1 dit 1375
2 r 1311
3 tait 1069
4 re 898
5 e 860
6 qu'il 810
7 plus 780
8 a 735
9 olivier 689
10 si 673
11 bien 656
12 tout 635
13 tre 544
14 d'un 533
15 comme 519
16 c'est 494
17 pr 481
18 pondit 472
19 juif 450
20 monsieur 424
At least half of these words should be getting captured by a stopwords list and they're not. Is there something I'm doing wrong in my code? I'm new to tidy text, so I'm sure there are better ways to get at this.