Reading data from twitter and then saving it in MongoDB
data.list <- searchTwitter('#demonetization ', n=10) data.df = twListToDF(data.list) temp=mongo.bson.from.df(data.df) mongo <- mongo.create() DB_Details <- paste(twitter, "filterstream", sep=".") mongo.insert.batch(mongo, DB_Details, temp)
Reading the data in MongoDB and saving it in dataset variable(all columns of table are stored in this variable).
mongo <- mongo(db = "twitter",collection = "filterstream",url = "mongodb://localhost") dataset <- mongo$find()
When i try printing the content of
dataset
variable there is no problem(See OUTPUT-1), but when i try to print acolumn from dataset
variable the output of column(See OUTPUT-2) differs from the previous output(OUTPUT-1).
OUTPUT1
> **dataset**
--------------------------------------------------
| id | text |
--------------------------------------------------
| 1 | <ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD>
<ed> <U+00B8> <U+0082><ed><U+00A0><U+00BD>
<ed> <U+00B1><U+0087>\nSome great jokes on #DeMonetization on
my TL today.\n\nThank you, Modi ji. <ed><U+00A0><U+00BD>
<ed><U+00B1><U+0087> |
--------------------------------------------------
| 2 | should be one |
--------------------------------------------------
OUTPUT-2
> **dataset$text**
| id | text |
--------------------------------------------------
| 1 | \xed��\xed�\u0082\xed��\xed�\u0082\xed��\xed�\u0087\nSome great jokes on #DeMonetization on my TL today.\n\nThank you, Modi ji. \xed��\xed�\u0087 |
--------------------------------------------------
| 2 | should be one |
--------------------------------------------------
4.Detecting these weird characters in OUTPUT-2 and getting rid of them is difficult. I am able to remove special characters(tags) and obtain clean text using REGEX for content of text column
in OUTPUT-1, but the content of text column
in OUTPUT-2 is quite different and i am not able to remove those special weird characters.
5.Why the content suddenly changes while printing a particular column from dataset, what am i doing wrong.