0

Here is one column of my df: [df$City]
(I have other columns, but I'm just showing one column for simplicity.)

City        
Seattle     
San Diego   
Bern       
SEATTLE
SEATTLE
BERN 

I want to do a frequency count on the cities. I want both "Seattle" and "SEATTLE" to be considered the same - basically, I want the frequency table calculation to be case insensitive.

If I use table(df) it gives me "Seattle" and "SEATTLE" as two different items. I tried to overcome this by using toupper(df) before doing table(df)

However, I get the error: invalid multibyte string.

I checked the encoding of my file and it seems to be UTF-8 - I could be wrong - is there a way for me to check the encoding?

Does anyone know how I can get a frequency table that is case insensitive? It doesn't have to be using my approach.

Thanks in advance!!

user4918087
  • 421
  • 1
  • 6
  • 14

1 Answers1

3

You'll want to look into iconv() for the UTF-8 conversion. Also, with the strings, you will probably have to use toupper() or tolower() to standardize them, and maybe stringr::str_trim() to take care of extra white-space...

cory
  • 6,529
  • 3
  • 21
  • 41
  • 1
    Worth mentioning [this](http://stackoverflow.com/questions/4993837/r-invalid-multibyte-string) post which goes into some reasons why the `invalid multibyte string` error could come up – MichaelChirico Jun 02 '15 at 20:51