42

I have a list of tweets and I would like to keep only those that are in English.

How can I do this?

zx8754
  • 52,746
  • 12
  • 114
  • 209
zoltanctoth
  • 2,788
  • 5
  • 26
  • 32

7 Answers7

49

The textcat package does this. It can detect 74 'languages' (more properly, language/encoding combinations), more with other extensions. Details and examples are in this freely available article:

Hornik, K., Mair, P., Rauch, J., Geiger, W., Buchta, C., & Feinerer, I. The textcat Package for n-Gram Based Text Categorization in R. Journal of Statistical Software, 52, 1-17.

Here's the abstract:

Identifying the language used will typically be the first step in most natural language processing tasks. Among the wide variety of language identification methods discussed in the literature, the ones employing the Cavnar and Trenkle (1994) approach to text categorization based on character n-gram frequencies have been particularly successful. This paper presents the R extension package textcat for n-gram based text categorization which implements both the Cavnar and Trenkle approach as well as a reduced n-gram approach designed to remove redundancies of the original approach. A multi-lingual corpus obtained from the Wikipedia pages available on a selection of topics is used to illustrate the functionality of the package and the performance of the provided language identification methods.

And here's one of their examples:

library("textcat")
textcat(c(
  "This is an English sentence.",
  "Das ist ein deutscher Satz.",
  "Esta es una frase en espa~nol."))
[1] "english" "german" "spanish" 
Ben
  • 41,615
  • 18
  • 132
  • 227
  • 4
    `textcat()` returns many misclassifications: I just ran it on 800 abstracts of academic articles which I know are either in German or English. Nevertheless, `textcat` classified 3 as Latin, 3 as French (?!) and 2 as Catalan (?!?!). The `cldr`-package proposed by @aykutfirat, however, exactly hit the spot on all texts and even proposes 2nd and 3rd alternatives. – KenHBS Nov 18 '16 at 17:21
  • 8
    Not a bad error rate I'd say for an approach that looks intended to be quick and dirty (ngram matching) – geotheory Mar 08 '18 at 16:20
30

The cldr package in a previous answer is not any more available on CRAN and may be difficult to install. However, Google's (Chromium's) cld libraries are now available in R through other dedicated packages, cld2 and cld3.

After testing with some thousands of tweets in multiple European languages, I can say that among available options, textcat is by far the least reliable. With textcat I also get quite frequently tweets wrongly detected as "middle_frisian", "rumantsch", "sanskrit", or other unusual languages. It may be relatively good with other types of texts, but I think textcat is pretty bad for tweets.

cld2 seems to be in general still better than cld3. If you want a safe way to include only tweets in English, you can still run both cld2 and cld3 and keep only tweets that are recognised as English by both.

Here's an example based on a Twitter search which usually offers result in many different languages, but always including some tweets in English.

if (!require("pacman")) install.packages("pacman") # for package manangement
pacman::p_load("tidyverse") 
pacman::p_load("textcat")
pacman::p_load("cld2")
pacman::p_load("cld3")
pacman::p_load("rtweet")

punk <- rtweet::search_tweets(q = "punk") %>% mutate(textcat = textcat(x = text), cld2 = cld2::detect_language(text = text, plain_text = FALSE), cld3 = cld3::detect_language(text = text)) %>% select(text, textcat, cld2, cld3)
View(punk)

# Only English tweets
punk %>% filter(cld2 == "en" & cld3 == "en")

Finally, I should perhaps add the obvious if this question is specifically related to tweets: Twitter provides via API its own language detection for tweets, and its seems to be pretty accurate (understandably less so with very short tweets). So if you run rtweet::search_tweets(q = "punk"), you will see that the resulting data.frame includes a "lang" column. If you get your tweets via API, then you can probably trust Twitter's own detection system more than the alternative solutions suggested above (which remain valid for other texts).

giocomai
  • 3,043
  • 21
  • 24
27

Try http://cran.r-project.org/web/packages/cldr/ which brings Google Chrome's language detection to R.

#install from archive
url <- "http://cran.us.r-project.org/src/contrib/Archive/cldr/cldr_1.1.0.tar.gz"
pkgFile<-"cldr_1.1.0.tar.gz"
download.file(url = url, destfile = pkgFile)
install.packages(pkgs=pkgFile, type="source", repos=NULL)
unlink(pkgFile)
# or devtools::install_version("cldr",version="1.1.0")

#usage
library(cldr)
demo(cldr)
aykutfirat
  • 469
  • 5
  • 4
  • 1
    I see this package is removed from CRAN. – user131476 Mar 31 '15 at 05:14
  • You can still download it from http://cran.us.r-project.org/src/contrib/Archive/cldr/ (I did not have time to make modifications to make it compatible with the new C language requirements of CRAN) – aykutfirat Mar 31 '15 at 17:42
  • 1
    if you have the compilation tools you may also be able to use devtools::install_version("cldr",version="1.1.0") to install – aykutfirat Apr 01 '15 at 18:32
  • @aykutfirat could you please share a list of the libraries needed to compile this in either Ubuntu or Fedora? or at least if there is some required package that is unusual in this respect? The error messages I receive when trying to install do not provide clear hints (at least, nothing I can really make sense of) – giocomai Oct 10 '17 at 07:43
  • Hmmm.. `cldr::detectLanguage("Their debut album")` == 100% Indonesian – geotheory Mar 09 '18 at 09:29
16

An approach in R would be to keep a text file of English words. I have several of these including one from http://www.sil.org/linguistics/wordlists/english/. After sourcing the .txt file you can use this file to match against each tweet. Something like:

lapply(tweets, function(x) EnglishWordComparisonList %in% x)

You'd want to have some threshold percentage to cut off to determine if it's English (I arbitrarily chose .06).

EnglishWordComparisonList<-as.vector(source(path to the list you downloaded above))

Englishinator<-function(tweet, threshold = .06) {
    TWTS <- which((EnglishWordComparisonList %in% tweet)/length(tweet) > threshold)
    tweet[TWTS]
    #or tweet[TWTS,] if the original tweets is a data frame
}

lapply(tweets, Englishinator)

I haven't actually used this because I use the English word list much differently in my research but I think this would work.

Jacob
  • 406
  • 3
  • 19
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
15

tl;dr: cld2 is the fastest by far (cld3 x22, textcat x118, handmade solution x252)

There's been a lot of discussion about accuracy here, which is understandable for tweets. But what about speed ?

Here's a benchmark of cld2, cld3 and textcat.

I threw in also some naïve function I wrote, it's counting occurences of stopwords in the text (uses tm::stopwords).

I thought for long texts I may not need a sophisticated algorithm, and testing for many languages might be detrimental. In the end my approach ends up being the slowest (most likely because the packaged approaches are looping in C.

I leave it here so I can spare time to those that would have the same idea. I expect the Englishinator solution of Tyler Rinker would be slow as well (testing for only one language, but much more words to test and similar code).

detect_from_sw <- function(text,candidates){
  sapply(strsplit(text,'[ [:punct:]]'),function(y)
    names(which.max(sapply(candidates,function(x) sum(tm::stopwords(x) %in% y))))
  )
}

The benchmark

data(reuters,package = "kernlab") # a corpus of articles in english
length(reuters)
# [1] 40
sapply(reuters,nchar)
# [1] 1311  800  511 2350  343  388 3705  604  254  239  632  607  867  240
# [15]  234  172  538  887 2500 1030  538 2681  338  402  563 2825 2800  947
# [29] 2156 2103 2283  604  632  602  642  892 1187  472 1829  367
text <- unlist(reuters)

microbenchmark::microbenchmark(
  textcat = textcat::textcat(text),
  cld2 = cld2::detect_language(text),
  cld3 = cld3::detect_language(text),
  detect_from_sw = detect_from_sw(text,c("english","french","german")),
  times=100)

# Unit: milliseconds
# expr                 min         lq      mean     median         uq         max neval
# textcat        212.37624 222.428824 230.73971 227.248649 232.488500  410.576901   100
# cld2             1.67860   1.824697   1.96115   1.955098   2.034787    2.715161   100
# cld3            42.76642  43.505048  44.07407  43.967939  44.579490   46.604164   100
# detect_from_sw 439.76812 444.873041 494.47524 450.551485 470.322047 2414.874973   100

Note on textcat's inaccuracy

I can't comment on the accuracy of cld2 vs cld3 (@giocomai claimed cld2 was better in his answer), but I confirm that textcat seems very unreliable (metionned in several places on this page). All texts were classified correctly by all methods above except this one classified as Spanish by textcat:

"Argentine crude oil production was \ndown 10.8 pct in January 1987 to 12.32 mln barrels, from 13.81 \nmln barrels in January 1986, Yacimientos Petroliferos Fiscales \nsaid. \n January 1987 natural gas output totalled 1.15 billion cubic \nmetrers, 3.6 pct higher than 1.11 billion cubic metres produced \nin January 1986, Yacimientos Petroliferos Fiscales added. \n Reuter"

moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
6

There is also a pretty well working R package called "franc". Though, it is slower than the others, I had a better experience with it than with cld2 and especially cld3.

Daniel Bendel
  • 61
  • 1
  • 3
  • A quick comparison of tweets showed that franc is performing as well as cld2 in detecting German language in 2000 tweets. cld3 detected another tweet correctly that franc and cld2 missed, but cld3 instead missed a German tweet that franc and cld2 accurately detected. – Simone Apr 12 '22 at 08:16
3

I'm not sure about R, but there are several libraries for other languages. You can find some of them collected here:

http://www.detectlanguage.com/

Also one recent interesting project:

http://blog.mikemccandless.com/2011/10/language-detection-with-googles-compact.html

Using this library Twitter languages map was produced:

http://www.flickr.com/photos/walkingsf/6277163176/in/photostream

If you will not find a library for R, I suggest to consider using remote language detector through webservice.

Laurynas
  • 3,829
  • 2
  • 32
  • 22
  • 1
    Thanks @Laurynas! I keep waiting for an R specific answer but your answer is great to start off with. However, Google Translate API (thus www.detectlanguage.com) will be disabled on 1 Dec 2011 (Google turns it into a paying service) – zoltanctoth Nov 10 '11 at 11:35
  • No prob :) If Google Translate will be disabled you can use Web Service at detectlanguage.com. I published it today. – Laurynas Nov 10 '11 at 16:29
  • Yay, that works pretty well! Is it possible that I have just checked this site around 10 hours ago and it was based on Google Translate that time? :) – zoltanctoth Nov 10 '11 at 21:45
  • Yes, it was using Google Translate for translation example (I moved it here: http://detectlanguage.com/translate). After your comment I created webservice which is based on C language detector (not on Google Translate). – Laurynas Nov 12 '11 at 13:39
  • @Laurynas What is the maximum number of requests allowed by the detecklanguage.com web service in a 24 hour period? – Tony Breyal Dec 05 '11 at 22:45