Removing text based on the Non-English language used in the text

Question

This is my sample dataset:

text=c("I went to Helsinki","I went to  Hélsinki","I went allé Helsinki",
       "je vais a Helsinli","I met Mr Smith","I met Monsiéur Smith","J'ai rencontré Monsieur Smith"
       )

rank=c(1,2,3,4,5,6,7)

df <- data.frame(text,rank)
df %>% top_n(10)

                           text rank
1            I went to Helsinki    1
2           I went to  Hélsinki    2
3          I went allé Helsinki    3
4            je vais a Helsinli    4
5                I met Mr Smith    5
6          I met Monsiéur Smith    6
7 J'ai rencontré Monsieur Smith    7

All texts are either in English or French . I'd like to remove texts that are mainly in French and not only having one or a few characters in French.

I am using the solution offered here as follows:

df %>%
    mutate(text_selected= iconv(text, from = "latin1", to = "ASCII")) %>% 
    select(text,text_selected)

                           text        text_selected
1            I went to Helsinki   I went to Helsinki
2           I went to  Hélsinki                 <NA>
3          I went allé Helsinki                 <NA>
4            je vais a Helsinli   je vais a Helsinli
5                I met Mr Smith       I met Mr Smith
6          I met Monsiéur Smith                 <NA>
7 J'ai rencontré Monsieur Smith                 <NA>

Using this solution, I have rows 1,5, and 7 classified correct but other rows are classified wrong. It is because this solution works based on character extraction I believe. For instance, in row 2 , there is character é in Hélsinki and that's why this text classified as NA or Non-English whereas the text is manly written in the English language. Or, in row 4 , the main language of the text is French but because there is no French character due to punctuation and written issues in the text, it is classified as English text.

I wonder if there is solution for classifying text based on its main language of the text that is more sophisticated than only finding one character in the text. So, in this case my ideal output should be:

                           text        text_selected
1            I went to Helsinki    I went to Helsinki
2           I went to  Hélsinki    I went to  Hélsinki
3          I went allé Helsinki    I went allé Helsinki
4            je vais a Helsinli            <NA>
5                I met Mr Smith    I met Mr Smith
6          I met Monsiéur Smith    I met Monsiéur Smith 
7 J'ai rencontré Monsieur Smith            <NA>

The `cld3` package has a function, `detect_language()` which should help: https://docs.ropensci.org/cld3/ — Phil, May 24 '21 at 00:37
@Phil, it is not 100% accurate but creates less `False Negative` compared to my approach. Well, the text is a column of my dataset with other columns. I had to create a tidy text data like `text_df <- tibble(line = 1:7 text = df$text)` and then `df_language <- text_df %>% mutate(text_language=detect_language(text))`. So, I can filter `df_language` by `text_language` but I have no clues how I can link it back to the original data (`df`). Would you please advise? — Alex, May 24 '21 at 03:09
This is asking a different question, but if I understand you right, you'll need to set up a unique id number per row (you can use `row_number()` from dplyr, and then use `left_join()` to merge the data frames back. — Phil, May 24 '21 at 03:16
Don't create a separate dataframe. `df %>% mutate(text_language = detect_language(text))` will give you a new column in your original `df`. You can filter the dataframe from this. — Ronak Shah, May 24 '21 at 03:49
@RonakShah, it creates this `Error: Problem with `mutate()` input `text_language`. x Parameter 'text' must be a connection or character vector i Input `text_language` is `detect_language(text)`.` — Alex, May 24 '21 at 04:15
@Alex I don't get any error on the data that you have shared. — Ronak Shah, May 24 '21 at 04:16
@RonakShah, well it is strange. I get the error I shared. Could it because of any packages version? — Alex, May 24 '21 at 04:23
@Phil, as you proposed I will post a separate question along with your answer. — Alex, May 24 '21 at 04:31

Removing text based on the Non-English language used in the text

0 Answers0