6

I'm doing data cleaning. I use mutate in Dplyr a lot since it generates new columns step by step and I can easily see how it goes.

Here are two examples where I have this error

Error: incompatible size (%d), expecting %d (the group size) or 1

Example 1: Get town name from zipcode. Data is simply like this:

    Zip
1 02345
2 02201

And I notice when the data has NA in it, it doesn't work.

Without NA it works:

library(dplyr)
library(zipcode)
data(zipcode)

test = data.frame(Zip=c('02345','02201'),stringsAsFactors=FALSE)

test %>%
  rowwise() %>%
  mutate( Town1 = zipcode[zipcode$zip==na.omit(Zip),'city'] )

resulting in

Source: local data frame [2 x 2]
Groups: <by row>

    Zip   Town1
1 02345 Manomet
2 02201  Boston

With NA it doesn't work:

library(dplyr)
library(zipcode)
data(zipcode)

test = data.frame(Zip=c('02345','02201',NA),stringsAsFactors=FALSE)

test %>%
  rowwise() %>%
  mutate( Town1 = zipcode[zipcode$zip==na.omit(Zip),'city'] )

resulting in

Error: incompatible size (%d), expecting %d (the group size) or 1

Example2. I wanna get rid of the redundant state name that occurs in the Town column in the following data.

         Town State
1   BOSTON MA    MA
2 NORTH AMAMS    MA
3  CHICAGO IL    IL

This is how I do it: (1) split the string in Town into words, e.g. 'BOSTON' and 'MA' for row 1. (2) see if any of these words match the State of that line (3) delete the matched words

library(dplyr)
test = data.frame(Town=c('BOSTON MA','NORTH AMAMS','CHICAGO IL'), State=c('MA','MA','IL'), stringsAsFactors=FALSE)

test %>%
  mutate(Town.word = strsplit(Town, split=' ')) %>%
  rowwise() %>% # rowwise ensures every calculation only consider currect row
  mutate(is.state = match(State,Town.word ) ) %>%
  mutate(Town1 = Town.word[-is.state])

This results in:

         Town State Town.word is.state   Town1
1   BOSTON MA    MA  <chr[2]>        2  BOSTON
2 NORTH AMAMS    MA  <chr[2]>       NA      NA
3  CHICAGO IL    IL  <chr[2]>        2 CHICAGO

Meaning: E.g., row 1 shows is.state==2, meaning the 2nd word in Town is the state name. After getting rid of that work, Town1 is the correct town name.

Now I wanna fix the NA in row 2, but add na.omit would cause error:

test %>%
  mutate(Town.word = strsplit(Town, split=' ')) %>%
  rowwise() %>% # rowwise ensures every calculation only consider currect row
  mutate(is.state = match(State,Town.word ) ) %>%
  mutate(Town1 = Town.word[-na.omit(is.state)]) 

results in:

Error: incompatible size (%d), expecting %d (the group size) or 1

I checked the data type and size:

test %>%
  mutate(Town.word = strsplit(Town, split=' ')) %>%
  rowwise() %>% # rowwise ensures every calculation only consider currect row
  mutate(is.state = match(State,Town.word ) ) %>%
  mutate(length(is.state) ) %>%       
  mutate(class(na.omit(is.state)))

results in:

         Town State Town.word is.state length(is.state) class(na.omit(is.state))
1   BOSTON MA    MA  <chr[2]>        2                1                  integer
2 NORTH AMAMS    MA  <chr[2]>       NA                1                  integer
3  CHICAGO IL    IL  <chr[2]>        2                1                  integer

So it is %d of length==1. Can somebody where's wrong? Thanks

YJZ
  • 3,934
  • 11
  • 43
  • 67

1 Answers1

3

Can you just sub it out?

test %>%
    rowwise() %>%
    mutate(Town=sub(sprintf('[, ]*%s$', State), '', Town))
## Source: local data frame [3 x 2]
## Groups: <by row>
##
##          Town State
## 1      BOSTON    MA
## 2 NORTH AMAMS    MA
## 3     CHICAGO    IL

(This way also catches commas after the town, if that happens.)

NB: if you use ungroup() here with a rowwise_df (as this is), it will wipe the tbl_df class as well and output a straight data.frame, which is fine for your data but will clobber your screen if you aren't careful and are looking at large amounts of data (as I've done countless times). (Github references #936 and #553.)

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thanks a lot @r2evans! Does [, ] mean an optional comma + space? Does [ ] mean something optional? – YJZ Jun 10 '15 at 06:11
  • The square brackets groups the space and comma together in a class, saying "one of these (two) characters", though it can be more than two and include ranges (such as `[A-Za-z0-9]` means "one upper or lower letter or a number"). Regex is an art, and it's often difficult to find a good cheatsheet/reference out there. [Wikibooks-R](http://en.wikibooks.org/wiki/R_Programming/Text_Processing#Regular_Expressions) is a good reference. – r2evans Jun 10 '15 at 06:16
  • The `*` after it makes anything immediately preceding it optional, such as the square-bracket class `[, ]`. It is read as "0 or more". Using a `+` instead makes it "1 or more". Both of them allow repeating characters or classes of characters. – r2evans Jun 10 '15 at 06:27
  • Thanks for the NB @r2evans you're big expert! I'm with you about the display difference between data.frame and tbl_df – YJZ Jun 10 '15 at 06:30
  • Actually @r2evans there will be an issue if '*' has made '[, ]' optional. Imagine a town called PUMA in MA. The MA in PUMA will be deleted. I guess an mandatory space in the regular expression can guarantee that only a separate word of state abbr will be deleted. – YJZ Jun 10 '15 at 06:35
  • Good point, so you can make it a `+`. Frankly, the comma addition didn't seem required by your data, but I wasn't sure about the rest of your data. If you are confident there will be no commas, regexes *always* do better when you simplify the pattern-matching. – r2evans Jun 10 '15 at 06:35
  • i see + means at least one time. Thanks @r2evans. It really take a while to understand regular expression. I'm glad I'm progressing. Is there somewhere I can see lots of examples? Thanks! – YJZ Jun 10 '15 at 06:42
  • Thanks @r2evans I was actually controlling the separators. I have replaced all punctuations with spaces, and removed redundant spaces. – YJZ Jun 10 '15 at 06:46
  • For examples, [wikibooks on R-programming](http://en.wikibooks.org/wiki/R_Programming/Text_Processing#Regular_Expressions) does well. – r2evans Jun 10 '15 at 06:47
  • I see. Thanks @r2evans! Why it says [:digit:] (one bracket) somewhere and [[:digit:]] (double layer bracket) somewhere else.... BTW, I still think I should figure out how to fix my original question. Separating the string into words and analyze should be more robust than gsub. Imagine Town = 'BOSTON MA BOSTON MA' and I wanna get rid of 'MA BOSTON MA'. What do you think? – YJZ Jun 10 '15 at 06:58
  • I think you have two more questions :-) – r2evans Jun 10 '15 at 07:15