1
  1. Reading data from twitter and then saving it in MongoDB

     data.list <- searchTwitter('#demonetization ', n=10)
     data.df = twListToDF(data.list)
     temp=mongo.bson.from.df(data.df)
     mongo <- mongo.create()
     DB_Details <- paste(twitter, "filterstream", sep=".")
     mongo.insert.batch(mongo, DB_Details, temp)
    
  2. Reading the data in MongoDB and saving it in dataset variable(all columns of table are stored in this variable).

     mongo <- mongo(db = "twitter",collection = "filterstream",url = "mongodb://localhost")
     dataset <- mongo$find()
    
  3. When i try printing the content of dataset variable there is no problem(See OUTPUT-1), but when i try to print a column from dataset variable the output of column(See OUTPUT-2) differs from the previous output(OUTPUT-1).

OUTPUT1

  > **dataset**    

   --------------------------------------------------
    | id        | text              |
    --------------------------------------------------
    | 1         | <ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD>
                   <ed> <U+00B8>               <U+0082><ed><U+00A0><U+00BD>
                   <ed> <U+00B1><U+0087>\nSome great jokes on #DeMonetization on 
                   my   TL today.\n\nThank you, Modi ji. <ed><U+00A0><U+00BD> 
                   <ed><U+00B1><U+0087>  |
    --------------------------------------------------
    | 2         | should be one              |
    --------------------------------------------------

OUTPUT-2

 > **dataset$text**   

    | id        | text              |
    --------------------------------------------------
    | 1         | \xed��\xed�\u0082\xed��\xed�\u0082\xed��\xed�\u0087\nSome great jokes on #DeMonetization on my TL today.\n\nThank you, Modi ji. \xed��\xed�\u0087  |
    --------------------------------------------------
    | 2         | should be one              |
    --------------------------------------------------

4.Detecting these weird characters in OUTPUT-2 and getting rid of them is difficult. I am able to remove special characters(tags) and obtain clean text using REGEX for content of text column in OUTPUT-1, but the content of text column in OUTPUT-2 is quite different and i am not able to remove those special weird characters.

5.Why the content suddenly changes while printing a particular column from dataset, what am i doing wrong.

Akki
  • 1,221
  • 3
  • 14
  • 33
  • May be of interest. http://stackoverflow.com/questions/25468716/convert-byte-encoding-to-unicode/25531299#25531299 – hwnd Dec 02 '16 at 20:39

0 Answers0