1

I am super new to R (like, this is my second day on it), but have some experience with network analysis. I'm trying to prep some data for analysis, but I can't get it cleaned up. I need to remove all capital letters, symbols, and punctuation from a column of twitter bios in my data. I've included a picture of the first part of the data.

I have tried code from similar posts, but it's not working and I'm not sure if it's because my data isn't formatted the right way (it's in a csv file). I've tried gsub, regex and a few others from other posts, but I'm sure I'm making some really basic mistakes, but I can't seem to see what I'm doing wrong.

I was trying to add a picture of what I have, but I can't seem to do that. To give you an idea, I have a csv file called twitterbios with three columns of data: "UserID", "bio", and "timestamp".

What I want is for all punctuation, capital letters, and symbols to be removed from the bios (column 2) of the twitterbios dataset. For example, one may say "I LOVE dogs!!! (Heart emoji)". I would want it to just say "i love dogs".

This may be too vague to be of any help, but I would be so grateful for any advice you can give me. Thanks!

screenshot of my computer/what I have in R studio

Community
  • 1
  • 1
lwe
  • 323
  • 1
  • 8
  • Welcome to SO. Please read https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Bill O'Brien Sep 10 '19 at 16:17
  • Hi, ``tolower()`` can be used for capital letters. You'll probably have to extract the character strings if you want to modify those. You can take a look at the ``grep()`` function and it's variations by doing ``?grep`` . I know you already take a look at it but without any sample of data (not an image) or the code you have already tried we can't really help you. – Gainz Sep 10 '19 at 16:20
  • It's better to copy paste the data, than using a screenshot. No one wants to manually recreate your data to put in an answer. – Trenton McKinney Sep 10 '19 at 16:30

1 Answers1

0

Here is a reproducible example of how you could do this with the packages stringr and dplyr. I'm not as sure how to get rid of emojis, but perhaps you could replace everything that's not a letter, number, or space with an empty string.

library(stringr)
library(dplyr)

strings <- c("HeRe is Some teXT. WhO WRITES thIs WAY?",
             "---PunctuaTION IS not A CRIME!!!!")
strings %>%
  str_to_lower() %>%
  str_replace_all("[:punct:]", "")

# [1] "here is some text who writes this way"
# [2] "punctuation is not a crime"
Gregory
  • 4,147
  • 7
  • 33
  • 44