How to delete rows that contain special characters in R

Question

summary(housingdata$City)

output ---> Amsterdam Amsterdam-Zuidoost BerlÃn

         14791                167                  1 
        Berlin     çˆ±ä¸\u0081å ¡  ì—\u0090ë“ ë²„ëŸ¬ 
         13641                  4                  1 
            NA             Others          Stockholm 
             0               8231                692 
          NA's 
            46

I tried the following codes, but they don't seem to work:

housingdata$City[housingdata$City == 'NA'] <- NA
housingdata$City[housingdata$City == 'çˆ±ä¸\u0081å'] <- NA
housingdata$City[housingdata$City == 'BerlÃn'] <- NA
housingdata$City[housingdata$City == 'ì—\u0090ë“ ë²„ëŸ¬'] <- NA

What does "doesn't seem to work" mean exactly? You seem to be just setting values to NA (missing) and not attempting to actually delete rows. Do you want the side of your data.frame to shrink? It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. — MrFlick, Apr 13 '20 at 19:10
Yes, I could either try to delete the rows containing those values or set them to NA. In the example above, I tried to set them to NA but nothing changes. — Tony Flager, Apr 13 '20 at 20:23

akrun · Accepted Answer · 2020-04-13T20:28:01.567

0

We can use grep to return only letters

subset(housingdata, grepl('^[A-Za-z_ -]+$', City))

edited Apr 13 '20 at 20:28

answered Apr 13 '20 at 19:08

akrun

874,273
37
540
662

1

Since it's in a frame, I recommend `gsub` or `grepl` instead of `grep(..., value=TRUE)`. – r2evans Apr 13 '20 at 19:09
Sorry, I am a beginner in R. How do I actually use the solution above for my problem? I tried the following: subset(housingdata, grepl('^[BerlÃn]+$', City)) and that doesn't seem to produce the output that I want. I would expect to delete the rows with unreadable characters or change the values to NA. And if I just copy the code above, the following message appears: "Error in grepl("^[A-Za-z_- ]+$", City) : invalid regular expression '^[A-Za-z_- ]+$', reason 'Invalid character range'" @r2evans – Tony Flager Apr 13 '20 at 20:26
@TonyFlager you are not usiing the same pattern in my post. Your pattern seems to be using `Ã` to check for those characters – akrun Apr 13 '20 at 20:27
@TonyFlager can you try the pattern now in the updated post – akrun Apr 13 '20 at 20:28
Ã–stermalm Amsterdam 0 14792 Amsterdam-Zuidoost Badhoevedorp 167 1 BANDHAGEN BerlÃn 1 0 Berlin BjÃ¶rkhagen 13640 0 Bruntsfield, Edinburgh çˆ±ä¸\u0081å ¡ 0 0 Ð‘ÐµÑ€Ð»Ð¸Ð½ Ð\u0090Ð¼Ñ\u0081Ñ‚ÐµÑ€Ð´Ð°Ð¼ 0 0 – Tony Flager Apr 14 '20 at 08:39
The code works and seems to filter the observations out, but when I code "summary(cleanhousingdata$City)", I can still see those 0 observations. How do I remove them so that they don't turn up anymore? – Tony Flager Apr 14 '20 at 08:40
1

@TonyFlager You can use `summary(droplevels(cleanhousingdata$City))` – akrun Apr 14 '20 at 17:54

How to delete rows that contain special characters in R

1 Answers1