0

I looked around both here and elsewhere, I found many similar questions but none which exactly answer mine. I need to clean up naming conventions, specifically replace/remove certain words and phrases from a specific column/variable, not the entire dataset. I am migrating from SPSS to R, I have an example of the code to do this in SPSS below, but I am not sure how to do it in R.

EG:

"Acadia Parish" --> "Acadia" (removes Parish and space before Parish)

"Fifth District" --> "Fifth" (removes District and space before District)

SPSS syntax:

COMPUTE county=REPLACE(county,' Parish','').

There are only a few instances of this issue in the column with 32,000 cases, and what needs replacing/removing varies and the cases can repeat (there are dozens of instances of a phrase containing 'Parish'), meaning it's much faster to code what needs to be removed/replaced, it's not as simple or clean as a regular expression to remove all spaces, all characters after a specific word or character, all special characters, etc. And it must include leading spaces.

I have looked at the replace() gsub() and other similar commands in R, but they all involve creating vectors, or it seems like they do. What I'd like is syntax that looks for characters I specify, which can include leading or trailing spaces, and replaces them with something I specify, which can include nothing at all, and if it does not find the specific characters, the case is unchanged.

Yes, I will end up repeating the same syntax many times, it's probably easier to create a vector but if possible I'd like to get the syntax I described, as there are other similar operations I need to do as well.

Thank you for looking.

Adam_S
  • 687
  • 2
  • 12
  • 24
  • county <- gsub(" Parish", "", county) – Ryan Morton Jan 26 '17 at 21:51
  • Your column *is* a vector. So using `gsub` that creates a modified vector is exactly what you want. – Gregor Thomas Jan 26 '17 at 21:54
  • Suggested duplicate: [In R, replace text within a string](http://stackoverflow.com/q/11936339/903061) – Gregor Thomas Jan 26 '17 at 21:57
  • Again, using `gsub` or similar and learning some regular expressions will probably be your best bet. Lots of resources, eg: http://regexr.com/ – adatum Jan 26 '17 at 21:57
  • @RyanMorton - that returns the error message "object 'county' not found." County is the variable name, not sure what I'm not doing b/c that looks right. given Gregor's explanation. – Adam_S Jan 26 '17 at 22:01
  • @Gregor - not duplicate. I want to specify words/characters to delete from all cases in a column, not specify a list of values then what to delete from them. Your link is the similar, but the opposite. I saw it earlier today and was unable to adapt it. – Adam_S Jan 26 '17 at 22:03
  • How does the data sit now? In a data frame? `dataFrameName$county` would be the way to call it if it's in a data frame. – Ryan Morton Jan 26 '17 at 22:07
  • @RyanMorton - yes that was exactly it. Thank you so much for following up your answer. I'm sure this stuff is like first baby steps. I've written spss and stata syntax for years but have no programming experience, so R is trickier. Thank you! – Adam_S Jan 26 '17 at 22:09
  • Sure, no problem. Best of luck! R is great! – Ryan Morton Jan 26 '17 at 22:12

3 Answers3

3
> x <- c("Acadia Parish", "Fifth District")
> x2 <- gsub("^(\\w*).*$", "\\1", x)
> x2
[1] "Acadia" "Fifth"

Legend:

  • ^ Start of pattern.
  • () Group (or token).
  • \w* One or more occurrences of word character more than 1 times.
  • .* one or more occurrences of any character except new line \n.
  • $ end of pattern.
  • \1 Returns group from regexp
Petr Javorik
  • 1,695
  • 19
  • 25
1

Maybe I'm missing something but I don't see why you can't simply use conditionals in your regex expression, then trim out the annoying white space.

string <- c("Arcadia Parish", "Fifth District")

bad_words <- c("Parish", "District") # Write all the words you want removed here!
bad_regex <- paste(bad_words, collapse = "|")

trimws( sub(bad_regex, "", string) )

# [1] "Arcadia" "Fifth" 
Chrisss
  • 3,211
  • 1
  • 16
  • 13
1
dataframename$varname <- gsub(" Parish","", dataframename$varname)
A. Suliman
  • 12,923
  • 5
  • 24
  • 37
Adam_S
  • 687
  • 2
  • 12
  • 24