4

I have several .csv files, each one corresponding to a monthly list of customers and some information about them. Each file consists of the same information about customers such as:

names(data.jan)

ID     AGE      CITY      GENDER

names(data.feb)

ID     AGE      CITY      GENDER

To simplify, I will consider only two months, january and february, but my real set of csv files go from january to november:

Considering a "customer X",I have three possible scenarios:

1- Customer X is listed in the january database, but he left and now is not listed in february 2- Customer X is listed in both january and february databases 3- Customer X entered the database in february, so he is not listed in january

I am stuck on the following problem: I need to create a single database with all customers and their respective information that are listed in both dataframes. However, considering a customer that is listed in both dataframes, I want to pick his information from his first entry, that is, january.

When I use merge, I have four options, acording to http://www.dummies.com/how-to/content/how-to-use-the-merge-function-with-data-sets-in-r.html

Merge options

data <- merge(data.jan,data.feb, by="ID", all=TRUE)

Regardless of which all, all.x or all.y I choose, I get the same undesired output called data:

data[1,]

ID     AGE.x      CITY.x      GENDER.x       AGE.y      CITY.y      GENDER.y
123      25         NY           M            25          NY            M

I think that what would work here is to merge both databases with this type of join:

enter image description here

Then, merge the resulting dataframe with data.jan with the full outer join. But I don't know how to code this in R.

Thanks,

Bernardo

Bernardo
  • 426
  • 3
  • 16
  • From manual for `merge`: "By default the data frames are merged on the columns with names they both have..." So you might want to drop `by=` option, so it will merge on common columns. – zx8754 Dec 04 '13 at 11:48
  • Do your columns for all files always the same as `ID AGE CITY GENDER`? – zx8754 Dec 04 '13 at 11:50
  • 3
    IMHO, merging all your data seems like a bad approach. You should instead concatenate all your files into one long data.frame: with many records per customer, but just four variables: `dat <- do.call(rbind, all_your_files)`. You will still be able to do per-customer analysis using things like `aggregate`, `plyr`, or `data.table`. – flodel Dec 04 '13 at 12:01
  • Actually, I was trying to avoid building a large database, since each .csv contains more than 500.000 rows. Multiplying that by eleven months gives me a total of 5.500.000 rows with 13 columns (I tried to simplify the data in my question, but these are the real numbers). Probably my poor PC won't be able to run this in a reasonable time. – Bernardo Dec 04 '13 at 13:04
  • A few million rows shouldn't be problematic with data like this on even very modest computers. Since it's not hard to try, give it a shot and reopen back before deciding that's not the right route. – JBecker Dec 04 '13 at 13:46
  • Working on that! haha Binding the databases hasn't been a problem so far, I am concerned about aggregating them. Let's see how that goes and if it doesn't work, I'll check the other proposed solutions – Bernardo Dec 04 '13 at 13:56

3 Answers3

1
 d1 <- data.frame(x=1:9,y=1:9,z=1:9)
 d2 <- data.frame(x=1:10,y=11:20,z=21:30) # example data
 d3 <- merge(d1,d2, by="x", all=TRUE) #merge


# keep the original columns from janary (i.e. y.x, z.x)
# but replace the NAs in those columns with the data from february (i.e. y.y,z.y )
d3[is.na(d3[,2]) ,][,2:3] <- d3[is.na(d3[,2]) ,][, 4:5]
#>  d3[, 1:3]
#    x y.x z.x
#1   1   1   1
#2   2   2   2
#3   3   3   3
#4   4   4   4
#5   5   5   5
#6   6   6   6
#7   7   7   7
#8   8   8   8
#9   9   9   9
#10 10  20  30

This may be tiresome for more than 2 months though, perhaps you should consider @flodel's comments, also note there are demons when your original Jan data has NAs (and you still want the first months data, NA or not, retained) although you never mentioned them in your question.

user1317221_G
  • 15,087
  • 3
  • 52
  • 78
  • Thanks for your answer! I will see if my PC can run flodel's suggestion. Regarding the NA cases, I haven't mentioned them because there are no such cases in any of my databases, but I will edit my question to clarify that! – Bernardo Dec 04 '13 at 13:12
0

Try:

data <- merge(data.jan,data.frame(ID=data.feb$ID), by="ID")

although I haven't tested it since no data, but if you just join the ID col from Feb, it should only filter out anything that isn't in both frames

Troy
  • 8,581
  • 29
  • 32
0

@user1317221_G's solution is excellent. If your tables are large (lots of customers), data tables might be faster:

library(data.table)
#  some sample data
jan <- data.table(id=1:10,  age=round(runif(10,25,55)), city=c("NY","LA","BOS","CHI","DC"), gender=rep(c("M","F"),each=5))
new <- data.table(id=11:16, age=round(runif(6,25,55)), city=c("NY","LA","BOS","CHI","DC","SF"), gender=c("M","F"))
feb <- rbind(jan[6:10,],new)
new <- data.table(id=17:22, age=round(runif(6,25,55)), city=c("NY","LA","BOS","CHI","DC","SF"), gender=c("M","F"))
mar <- rbind(jan[1:5,],new)

setkey(jan,id)
setkey(feb,id)

join <- data.table(merge(jan, feb, by="id", all=T))
join[is.na(age.x) , names(join)[2:4]:= join[is.na(age.x),5:7,with=F]]

Edit: This adds processing for multiple months.

f <- function(x,y) {
  setkey(x,id)
  setkey(y,id)
  join <- data.table(merge(x,y,by="id",all=T))
  join[is.na(age.x) , names(join)[2:4]:= join[is.na(age.x),5:7,with=F]]
  join[,names(join)[5:7]:=NULL]                # get rid of extra columns
  setnames(join,2:4,c("age","city","gender"))  # rename columns that remain
  return(join)
}

Reduce("f",list(jan,feb,mar))

Reduce(...) applies the function f(...) to the elements of the list in turn, so first to jan and feb, and then to the result and mar, etc.

jlhoward
  • 58,004
  • 7
  • 97
  • 140