0

summary(housingdata$City)

output ---> Amsterdam Amsterdam-Zuidoost Berlín

         14791                167                  1 
        Berlin     爱ä¸\u0081å ¡  ì—\u0090ë“ ë²„ëŸ¬ 
         13641                  4                  1 
            NA             Others          Stockholm 
             0               8231                692 
          NA's 
            46 

I tried the following codes, but they don't seem to work:

housingdata$City[housingdata$City == 'NA'] <- NA
housingdata$City[housingdata$City == '爱ä¸\u0081å'] <- NA
housingdata$City[housingdata$City == 'BerlÃn'] <- NA
housingdata$City[housingdata$City == 'ì—\u0090ë“ ë²„ëŸ¬'] <- NA
Tony Flager
  • 95
  • 1
  • 8
  • What does "doesn't seem to work" mean exactly? You seem to be just setting values to NA (missing) and not attempting to actually delete rows. Do you want the side of your data.frame to shrink? It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Apr 13 '20 at 19:10
  • Yes, I could either try to delete the rows containing those values or set them to NA. In the example above, I tried to set them to NA but nothing changes. – Tony Flager Apr 13 '20 at 20:23

1 Answers1

0

We can use grep to return only letters

subset(housingdata, grepl('^[A-Za-z_ -]+$', City))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Since it's in a frame, I recommend `gsub` or `grepl` instead of `grep(..., value=TRUE)`. – r2evans Apr 13 '20 at 19:09
  • Sorry, I am a beginner in R. How do I actually use the solution above for my problem? I tried the following: subset(housingdata, grepl('^[BerlÃn]+$', City)) and that doesn't seem to produce the output that I want. I would expect to delete the rows with unreadable characters or change the values to NA. And if I just copy the code above, the following message appears: "Error in grepl("^[A-Za-z_- ]+$", City) : invalid regular expression '^[A-Za-z_- ]+$', reason 'Invalid character range'" @r2evans – Tony Flager Apr 13 '20 at 20:26
  • @TonyFlager you are not usiing the same pattern in my post. Your pattern seems to be using `Ã` to check for those characters – akrun Apr 13 '20 at 20:27
  • @TonyFlager can you try the pattern now in the updated post – akrun Apr 13 '20 at 20:28
  • Östermalm Amsterdam 0 14792 Amsterdam-Zuidoost Badhoevedorp 167 1 BANDHAGEN Berlín 1 0 Berlin Björkhagen 13640 0 Bruntsfield, Edinburgh 爱ä¸\u0081å ¡ 0 0 Берлин Ð\u0090мÑ\u0081тердам 0 0 – Tony Flager Apr 14 '20 at 08:39
  • The code works and seems to filter the observations out, but when I code "summary(cleanhousingdata$City)", I can still see those 0 observations. How do I remove them so that they don't turn up anymore? – Tony Flager Apr 14 '20 at 08:40
  • 1
    @TonyFlager You can use `summary(droplevels(cleanhousingdata$City))` – akrun Apr 14 '20 at 17:54