1

I'm practicing text mining for a PhD project in Human Science on Tweets (Twitter).

I have some difficulties to strip the special characters (“”) which are coded by Twitter API to enclose the new "retweet with comment" function.

I have tested to use (\“) or (\'') without success.

I encountered the following error each time:

(...error tex...)'\�'(...error text...)

which means that R do not recognize this both special characters (“) and (”).

For instance, I have replace each first character before a (@) by a (") which allow me apply basic statistics on the variable "retweet with comment", but I can't go further with text mining functions to work on the characters inside (“@...”).

Have anyone ever encounter this type of trouble ?

R.Version()
$platform
[1] "x86_64-apple-darwin10.8.0"

$arch
[1] "x86_64"

$os
[1] "darwin10.8.0"

$system
[1] "x86_64, darwin10.8.0"

$status
[1] ""

$major
[1] "3"

$minor
[1] "1.0"

$year
[1] "2014"

$month
[1] "04"

$day
[1] "10"

$`svn rev`
[1] "65387"

$language
[1] "R"

$version.string
[1] "R version 3.1.0 (2014-04-10)"

$nickname
[1] "Spring Dance"
Coding Enthusiast
  • 3,865
  • 1
  • 27
  • 50
Cyrille
  • 81
  • 1
  • 7
  • 1
    You should provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) with sample input and desired output. Exactly what functions are you calling? – MrFlick Jul 03 '15 at 17:47
  • @MrFlick : Yes : for example - from Twitter API - stream function - here the text variable of a tweet in my data frame with "retweet with comment" format : _“@username: text of the tweet” text of the comment_ I would like to create quantitative & qualitative new variables attached to "retweet with comment" variable, but there is no way to sub, gsub or use stringr functions to get the special characters “ or ” , R returns an error (in french): Erreur : `'\�' est un code escape non reconnu dans une chaîne de caractères débutant "\�"` – Cyrille Jul 03 '15 at 18:28
  • @MrFlick : I think that R do not recognize the special characters `“` and `”`and that's why he returns an error `?`, even if I use the \ before the special character. What I have done currently is that I replace the leading `”`at the beginning of all "retweet with comment" tweets in the data frame by a " (which is recognized by text mining functions as a special character) to create only a quantitative variable to count the variable. But I would like to do more with qualitative variables & text mining functions. – Cyrille Jul 03 '15 at 18:32
  • @Cyrille thank you for all your specification but maybe I guess that what MrFlick wanted to say is: just post a piece of your code and a desired output. It is very hard for us to try to guess the problem without an evidence even if supported with your informations. – SabDeM Jul 03 '15 at 18:39
  • @MrFlick: I will to extract an anon part of my data and to write up a piece of my code to let you reproduce the case on your system. – Cyrille Jul 03 '15 at 18:41
  • @SabDeM : OK. Just let me a few moments, need to anonymized a part of the data, which is really important in this type of research project. – Cyrille Jul 03 '15 at 18:43
  • @MrFlick : OK I begin to understand - when I try to create a new object for test to create a reprocible example for you `test <- “@user_name: What can I get you to drink?” Sweet tea, please.` I get this error in R console : `Erreur : entrée inattendu(e) in "test <- �"` So R do no recognize this type of symbol from Twitter API (stream). The original data are coming from the JSON data which have been data mined from Twitter API. – Cyrille Jul 03 '15 at 18:54
  • @SabDeM : OK I begin to understand - when I try to create a new object for test to create a reprocible example for you test <- “@user_name: What can I get you to drink?” Sweet tea, please. I get this error in R console : Erreur : entrée inattendu(e) in "test <- �" So R do no recognize this type of symbol from Twitter API (stream). The original data are coming from the JSON data which have been data mined from Twitter API. – Cyrille Jul 03 '15 at 18:54
  • what happens when you enter `"“"` in the `R` console? – MichaelChirico Jul 03 '15 at 18:56
  • Strange. `gsub("“”","","“weird quotes”")` returns `"“weird quotes”"` (i.e. unchanged); `gsub("\“\”","","“weird quotes”")` returns the error described; but `gsub("[[:punct:]]","","“weird quotes”")` returns `"weird quotes"` (i.e. your "weird quotes" are counted in `[[:punct:]]` somewhere. – MichaelChirico Jul 03 '15 at 19:00
  • And `?regex` suggests the following are included in `[[:punct:]]`: `! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~.`... but this doesn't include your weird quotes – MichaelChirico Jul 03 '15 at 19:02
  • @MichaelChirico : Yes I understand. Great. “ & ” are not special characters for R, there are text characters. So if I call them without \, it works perfectly. I took the solution to replace them with gsub function by " special character to deal with. Very good. – Cyrille Jul 03 '15 at 19:04
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/82328/discussion-between-cyrille-and-michaelchirico). – Cyrille Jul 03 '15 at 19:06
  • The issue is resolved : “ & ” are not considered as special characters, but probably as [[:alpha:]] normal text for R. To manage with this type of weird quotes, you have not to use \ before to call them in sub or gsub or stringr function. – Cyrille Jul 03 '15 at 19:39
  • @Cyrille could you write that up as an answer? – Nick Kennedy Jul 03 '15 at 23:55
  • @NickK Yes I would, but I must admit that I don't know how to do that. I will go in the help menu. – Cyrille Jul 05 '15 at 06:08

1 Answers1

1

As suggested by @MichaelChirico, when you type "“" in the R console, R returns [1] "“", which means that R can recognize the weird quotes “”.

From ?regex we can see that:

The metacharacters in extended regular expressions are . \ | ( ) [ { ^ $ * + ?.

This does not include "“" or "”". Thus, to manage with this type of weird quotes with sub, gsub or stringr functions for text mining, you do not need to use \ before them.

As confirmed by @NickK, the weird quotes are considered as [[:punct:]] in R.

For a researcher working on Data Science on social media, specifically on Twitter data (Tweets) collected through the public stream Twitter API, this tip can help you to manage unstructured data in Tweet text and especially the new Twitter interaction "retweet with comment", which are in this format: “@user.screen_name: text of the original tweet” text of the comment.

Cyrille
  • 81
  • 1
  • 7
  • Just edited a little - the 'weird' quotes are actually regarded as `[[:punct::]` - try `x <- c("“", "”", '"', "a"); grep("[[:punct:]]", x)`. – Nick Kennedy Jul 05 '15 at 07:22