5

I have a data frame of 59720 obs. that looks like below. I want to assign a MARKETNAME to each observation from a lookup table.

> data (a)

     DAY  HOUR LEAD Row.Count     DATE    ITIME  HOMEPHONE            CITY  STATE ZIPCODE     ZONENAME
1 Monday 13:00    1      9430 7/1/2013 13:42:51            FORT LAUDERDALE     FL  33315       68
2 Monday 13:00    1      9432 7/1/2013 13:43:50 xxxxx9802x  PLEASANT GROVE     AL  35127       82
3 Monday 13:00    1      9434 7/1/2013 13:46:18 5xxxx85x10      ORO VALLEY     AZ  85737       54
4 Monday  0:00    1      9435 7/1/2013  0:04:34 50xxxx1x364          SPOKANE    WA  99204      211
5 Monday 11:00    1      9436 7/1/2013 11:45:43 951xxxxx20        RIVERSIDE    CA  92507       31
6 Monday 11:00    1      9437 7/1/2013 11:46:26 760xxxxx679            VISTA    CA  92081      539

I have a lookup table of zip codes with 43126 unique zip codes that looks like this:

> data (b)

MARKETNAME            ZIPCODE
NEW YORK              00501
NEW YORK              00544
SPRINGFIELD-HOLYOKE   01001
SPRINGFIELD-HOLYOKE   01002
SPRINGFIELD-HOLYOKE   01003
SPRINGFIELD-HOLYOKE   01004

I wanted to simply assign the MARKETNAME to my dataset "a" comparing the ZIPCODE in "b". So I used

> c <- merge(a, b, by="ZIPCODE") .

It returned 58,972 obs. which meant I lost 748 obs. I did not want to lose any record from a so I changed my code as follows:

> c <- merge (a, b, by = "ZIPCODE" , all.x=TRUE) .

Strangely this returned 61,652 obs. instead of my expectation which was returning 59,720 obs. as per original a data frame with some NAs.

As per the documentation,

"if TRUE, then extra rows will be added to the output, one for each row in x that has no matching row in y. These rows will have NAs in those columns that are usually filled with values from y. The default is FALSE, so that only rows with data from both x and y are included in the output."

My interpretation of this is definitely wrong. Can someone please explain what I am doing wrong and how I can accomplish this simple task?

I referred : How to merge data frames and change element values based on certain conditions?, Subsetting and Merging from 2 Related Data Frames in r, how to merge two unequal size data frame in R but none of them are akin to my problem.

Community
  • 1
  • 1
vagabond
  • 3,526
  • 5
  • 43
  • 76
  • 2
    One possibility: you may have repeated keys in data set `b`. Check `nrow(b) == length(unique(b$ZIPCODE))`. If you have `nrow(b) > length(unique(b$ZIPCODE))`, then for each matching key pair, you will multiple rows added to your merged data set. – ialm Jul 31 '14 at 23:00
  • 1
    Your zipcodes in "data(a)" (whatever that might be since it's not R code) have leading and training spaces in them. – IRTFM Jul 31 '14 at 23:31
  • I'll edit the q with the dput but I did clean spaces before reading it in. – vagabond Jul 31 '14 at 23:46
  • 1
    You can also specify `strip.white = TRUE` when reading the data with `read.table`. – talat Aug 01 '14 at 06:42
  • Related: http://stackoverflow.com/questions/8433523/creating-a-new-variable-from-a-lookup-table – landroni Aug 01 '14 at 09:12
  • @ialm nope. Taken care of that. It's all unique. – vagabond Aug 01 '14 at 10:14
  • @ialm you were right ! I have repeated keys. Two Marketnames share the same ZIP CODE. Except my problem has become more complex now. I need to use other keys. Additionally my merged data frame two problems: (1) There are NA values in DMANAMES which were not matched. (2) There are duplicates in my merged data frame- two records with different DMA NAMES but they are actually the same observation. This post has been helpful. http://stackoverflow.com/questions/11369961/subset-with-unique-cases-based-on-multiple-columns – vagabond Aug 01 '14 at 16:55

1 Answers1

3

I prefer join from plyr which by default is a left-join returning all matches of records in the first data frame.

c <- join(a, b, by="ZIPCODE")

Ricky
  • 4,616
  • 6
  • 42
  • 72