Reading a CSV file and to tokenize it.

Question

I am a newbie in R. I have been trying to read a CSV file like this.

tweets <- read.csv("tweets.csv")

and I need to be able to remove all of the punctuations, convert to lower cases, remove numbers & stop words & whitespaces from the data frame 'tweets' without having to convert it into a corpus or something. Nothing fancy just straight removing it. Is there any library/function that could help solve this issue?

Reading a CSV file and then processing/cleaning it are different steps. I would suggest breaking this into two questions, one for reading the CSV file if that is giving you trouble (please share error messages, and maybe a sample of the file) and another question focused on cleaning it (again, show a sample, and what you have tried). — Gregor Thomas, Oct 10 '17 at 16:47
If you've read the CSV file successfully, then don't mention it any more, just say "I have a data frame I need to clean". But still show what you've tried. Searching the R tag for ["remove punctuation" (click for link)](https://stackoverflow.com/search?q=%5Br%5D+remove+punctuation) and trying some of what you find would be a good start. Tool/package/library requests are off topic. — Gregor Thomas, Oct 10 '17 at 16:49
I tried this but not working. Most of the other functions I am finding online are also doing the same.. tw[] <- lapply(tw, function(x) { if (is.list(x)) { lapply(x, function(y) { tolower(gsub("[.,]", "", y)) }) } else { tolower(gsub("[.,]", "", x)) } }) tw I am getting this: $tolower.as.matrix.tw.. [1] "" "" — Adee Thyagarajan, Oct 10 '17 at 18:01
Great! Here's how to proceed: 1. Edit your question to get rid of the reading a CSV stuff unless that's a problem. 2. Share a little sample data. [(LOTS of tips here - make it copy/pastable, `dput(droplevels(head(tweets)))` is probably all you need to do)](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) 3. Edit that code you tried into your question - it's very hard to read in a comment. 4. Also add errors that you got to your question. Then you'll have a good, answerable question! — Gregor Thomas, Oct 10 '17 at 18:06
Thanks for providing me with the much needed push at that moment. A successful Twitter Sentiment Analysis done. — Adee Thyagarajan, Oct 10 '17 at 23:16

score 0 · Answer 1 · answered Oct 10 '17 at 19:49

0

Reading part of csv is what you have defined

tweets <- read.csv("tweets.csv")

However, for dealing with punctuations, whitespaces the other approach except using corpus is by using regular expressions but that has limited application as it is not generic at all

That is why we prefer corpus as it can become easier to apply to different sources

answered Oct 10 '17 at 19:49

jatin singh

123
1
1
13

Thanks a lot for that. Was able to do it. – Adee Thyagarajan Oct 10 '17 at 23:17

Reading a CSV file and to tokenize it.

1 Answers1