This is my sample dataset:
text=c("I went to Helsinki","I went to Hélsinki","I went allé Helsinki",
"je vais a Helsinli","I met Mr Smith","I met Monsiéur Smith","J'ai rencontré Monsieur Smith"
)
rank=c(1,2,3,4,5,6,7)
df <- data.frame(text,rank)
df %>% top_n(10)
text rank
1 I went to Helsinki 1
2 I went to Hélsinki 2
3 I went allé Helsinki 3
4 je vais a Helsinli 4
5 I met Mr Smith 5
6 I met Monsiéur Smith 6
7 J'ai rencontré Monsieur Smith 7
All texts are either in English
or French
. I'd like to remove texts that are mainly in French and not only having one or a few characters in French.
I am using the solution offered here as follows:
df %>%
mutate(text_selected= iconv(text, from = "latin1", to = "ASCII")) %>%
select(text,text_selected)
text text_selected
1 I went to Helsinki I went to Helsinki
2 I went to Hélsinki <NA>
3 I went allé Helsinki <NA>
4 je vais a Helsinli je vais a Helsinli
5 I met Mr Smith I met Mr Smith
6 I met Monsiéur Smith <NA>
7 J'ai rencontré Monsieur Smith <NA>
Using this solution, I have rows 1,5, and 7
classified correct
but other rows are classified wrong. It is because this solution works based on character extraction I believe. For instance, in row 2
, there is character é
in Hélsinki
and that's why this text classified as NA
or Non-English
whereas the text is manly written in the English
language. Or, in row 4
, the main language of the text is French
but because there is no French character due to punctuation and written issues in the text, it is classified as English text.
I wonder if there is solution for classifying text based on its main language of the text that is more sophisticated than only finding one character in the text. So, in this case my ideal output should be:
text text_selected
1 I went to Helsinki I went to Helsinki
2 I went to Hélsinki I went to Hélsinki
3 I went allé Helsinki I went allé Helsinki
4 je vais a Helsinli <NA>
5 I met Mr Smith I met Mr Smith
6 I met Monsiéur Smith I met Monsiéur Smith
7 J'ai rencontré Monsieur Smith <NA>