How to use gsub function to get rid of all special characters at the start of each record of Address column?

Question

Address <- c("#20 W Irving ST","@1 East Street",
             "%222 Rockfard Avenue","-145 W Locust","& 99 East Locus")
Number <- c("A-1","A-2","A-3","A-4","A-5")
DF <- data.frame(Address,Number)

Is the first element of each string a special character? If so, perhaps just delete the first element, i.e., delete the first 'space'. — Mark Miller, May 26 '16 at 22:22
@MarkMiller the data set has around 70,000 records and I need to match the address column for destination and origin, but the data has some bad entries like the one I mentioned above which starts with special character, so I was trying to get rid of special characters only which are present at the start of address. — KGarg, May 27 '16 at 00:37
Are you saying some of the addresses have a special character at the beginning and some of the addresses do not? — Mark Miller, May 27 '16 at 00:43

score 2 · Answer 1 · answered May 27 '16 at 01:12

Just remove any repeated punctuation or space characters immediately following the start of the string. In regex speak:

gsub("^[[:punct:][:space:]]+","",DF$Address)
#[1] "20 W Irving ST"      "1 East Street"       "222 Rockfard Avenue" "145 W Locust"       
#[5] "99 East Locus"

score 1 · Answer 2 · edited May 23 '17 at 12:31

Will this do what you want? This assumes the first element of every Address is a special character. Note also that for this code to work, the left-hand end of my.data$Address must be flush with the left edge of the R GUI. There cannot be any empty characters at the start of Address.

my.data <- read.csv(text = '

        Address,        Number
#20 W Irving ST,         A-1
@1 East Street,          A-2
%222 Rockfard Avenue,    A-3
-145 W Locust,           A-4
& 99 East Locus,         A-5

', header = TRUE, stringsAsFactors = FALSE, na.string = 'NA')

my.data

my.data$Address <- substr(my.data$Address, 2, nchar(my.data$Address))
my.data

If the special characters can occur anywhere in Address and you want to remove all of the special characters you can try one of the functions presented here:

Replace multiple arguments with gsub

I used the function written by Theodore Lytras with this line:

mgsub(c('#','@','%','-','&'), c('','','','',''), my.data$Address)

Note that with both approaches the address 99 East Locus now begins with an empty space.

If some of the addresses have a special character in their first element and some of the addresses do not, this might work:

my.data <- read.csv(text = '

        Address,        Number
#20 W Irving ST,         A-1
@1 East Street,          A-2
222 W Locust,            A-4
%222 Rockfard Avenue,    A-3
-145 W Locust,           A-4
5 East Street,           A-2
& 99 East Locus,         A-5

', header = TRUE, stringsAsFactors = FALSE, na.string = 'NA')

first.char <- substr(my.data$Address, 1, 1)

my.data$Address <- ifelse(first.char %in% c('#','@','%','-','&'), substr(my.data$Address, 2, nchar(my.data$Address)), my.data$Address)
my.data

Miler I tried but it is having the same output as input, its not getting rid of the special characters — KGarg, May 27 '16 at 00:57
All of the examples I presented are working on my computer. I do not know what the problem is. Perhaps put your code in your post and I can take a look. — Mark Miller, May 27 '16 at 01:01
@ Mark Miler, It worked! I made a mistake. Thank you so much! — KGarg, May 27 '16 at 01:09

How to use gsub function to get rid of all special characters at the start of each record of Address column?

2 Answers2

Linked