1

More than the programming, I am lost on the right approach for this problem. I have 2 data frames with a market name column. Unfortunately the names vary by a few characters / symbols in each column, for e.g. Albany.Schenectady.Troy = ALBANY, Boston.Manchester = BOSTON.

I want to standardize the market names in both data frames so I can perform merge operations later.

I thought of tackling the problem in two steps: 1) Create a vector of the unique market names from both tables and use that to create a look up table. Something that looks like:

Table 1 Markets > "Albany.Schenectady.Troy" , "Albuquerque.Santa.Fe", "Atlanta" . . . .

Table2 Markets > "SPOKANE" , "BOSTON" . . .

I tried marketnamesvector <- paste(unique(Table1$Market, sep = "", collapse = ",")) but that doesn't produce the desired output.

2) Change Market names in Table 2 to equivalent market names in Table 1. For any market name not available in Table 1, Table 2 should retain the same value in market name.

I know I could use a looping function like below but I still need a lookup table I think.

replacefunc <- function (data, oldvalue, newvalue) { newdata <- data for (i in unique(oldvalue)) newdata[data == i] <- newvalue[oldvalue == i] newdata }

Table 1: This table is 90 rows x 2 columns and has 90 unique market names.

                   Market    Leads   Investment         Leads1     Leads2   Leads3
1 Albany.Schenectady.Troy      NA           NA            NA        NA       NA
2    Albuquerque.Santa.Fe      NA           NA            NA        NA       NA
3                 Atlanta      NA           NA            NA        NA       NA
4                  Austin      NA           NA            NA        NA       NA
5               Baltimore      NA           NA            NA        NA       NA

Table 2 : This table is 150K rows x 20 columns and has 89 unique market names.

> df

    Spot.ID    Date Hour     Time Local.Date Broadcast.Week Local.Hour Local.Time  Market
2   13072765 6/30/14    0 12:40 AM 2014-06-29              1         21    9:40 PM SPOKANE
261 13072946 6/30/14    5  5:49 AM 2014-06-30              1          5    5:49 AM  BOSTON
356 13081398 6/30/14   10 10:52 AM 2014-06-30              1          7    7:52 AM SPOKANE
389 13082306 6/30/14   11 11:25 AM 2014-06-30              1          8    8:25 AM SPOKANE
438 13082121 6/30/14    8  8:58 AM 2014-06-30              1          8    8:58 AM  BOSTON
469 13081040 6/30/14    9  9:17 AM 2014-06-30              1          9    9:17 AM  ALBANY
482 13080104 6/30/14   12 12:25 PM 2014-06-30              1          9    9:25 AM SPOKANE
501 13082120 6/30/14    9  9:36 AM 2014-06-30              1          9    9:36 AM  BOSTON
617 13080490 6/30/14   13  1:23 PM 2014-06-30              1         10   10:23 AM SPOKANE
vagabond
  • 3,526
  • 5
  • 43
  • 76
  • You could use the [regexpr](http://stackoverflow.com/questions/3539826/whats-the-r-equivalent-of-sqls-like-description-statement/3540002#3540002) function in a loop, to compute the similarity between each market name from Table 1 and all names from Table 2. Then, based on the similarity score, you could then replace the names from Table 1 by the closest equivalent from Table 2. It is not efficient, but it is straightforward to code; do you really need efficiency in this problem? – Jealie Oct 21 '14 at 20:50
  • Would it be possible for you to share all names from both files? From what I see from the present question is that you may want to do something like `gsub("\\..*$", "", x)` to take Albany and Albuquerque. If you can provide all names, I think you will receive more support. – jazzurro Oct 22 '14 at 00:02
  • Just as an idea, strsplit the Table 1 market names on period, then tolower() them and the Market names in Table 2. Then after calling setdiff() on the two columns you could rename the few Table 2s (which I assume are factors) quite easily. Having made the names the same, you can merge. – lawyeR Oct 22 '14 at 04:47
  • I actually created a lookup `data.frame` and performed a `merge` operation. I had to manually scan through the 90 names to match them accurately. I haven't tried the ideas mentioned in comments above yet but thank you for all possible suggestions. I will post the results. – vagabond Oct 22 '14 at 19:02

1 Answers1

2

Assume that the data is in data frames df1, df2. The goal is to adjust the market names to be the same, they are currently slightly different.

First, list the markets, use the following command to list the unique names in df1, repeat for df2.

mk1 <- sort(unique(df1$market))
mk2 <- sort(unique(df2$market))
dmk12 <- setdiff(mk1,mk2)
dmk21 <- setdiff(mk2,mk1)

Use dmk12 and dmk21 to identify the different markets. Decide what names to use, and how they match up, let's change "Atlanta, GA" from df1 to "Atlanta" from df2. Then use

df2[df2$market=="Atlanta","market"] = "Atlanta, GA"

The format is

df_to_change[df_to_change[,"column"]=="old data", "column"] = "new data"

If you only have 90 names to correct, I would write out 90 change lines like the one above.

After adjusting all the names, do sort(unique(df)) again and use setdiff twice to confirm all the names are the same.

  • I actually created a lookup `data.frame` and performed a `merge` operation. I had to manually scan through the 90 names to match them accurately. I haven't tried the ideas mentioned in comments above yet. – vagabond Oct 22 '14 at 19:01