0

I have some data that looks like this:

Date        PX_SETTLE   CHG_NET_1D  Date.1      PX_SETTLE.1 CHG_NET_1D.1      Date.3
12/7/1988   91.07       0.03        12/7/1988   90.93       -0.02         12/7/1988
12/8/1988   91.09       0.02        12/8/1988   N/A         -0.02         12/8/1988
12/9/1988   91.1        0.01        12/9/1988   90.88        N/A          12/9/1988
12/12/1988  91.1        0           12/12/1988  90.86       -0.02         12/12/1988
12/15/1988  91.12       0.02        12/15/1988  90.76       -0.1          12/15/1988

And this pattern goes on for several more columns.

As the data progresses, the dates start to come out of alignment (this is the raw data). What I want to do is: merge all the data by Date (Date, Date.1, Date.3.... Date.N) and also delete rows containing at least one N/A value.

I pulled this data from Bloomberg, so each row doesn't share the same date. Eventually you'll see a row like

12/7/1988   91.07       0.03        12/8/1988   90.93       -0.02         12/6/1988

In the past, I have just split the data frame into its constituent parts (individual commodities each with a date, price and net change), then merged them one by one using 'merge'. Id like to do this in one go, going forward.

EDIT: Ok, I edited the original date columns in excel to be in standard unambiguous format. Further, I ran the following:

for(i in 1:length(colnames(DF))) {
  if(!grepl("Date", colnames(DF))[i]) {
    DF[,i] <- as.numeric(as.character(DF[, i]))
  } else {
    DF[,i] <- as.Date(as.character(DF[, i]))
  }
}

Now my dates are of class 'Date' and my numbers are of class 'numeric'. I am looking into using data.table.

Dylan_Larkin
  • 503
  • 4
  • 15
  • It's not quite clear what the problem is. Is this the data after `read.csv`, or is this the result of reading from CSV? If this is the result of reading, is this part correct, or does it need to be fixed before you proceed? Can you post what you're expecting to see, and some code? – shoover Jul 23 '15 at 17:43
  • editing ... please hold – Dylan_Larkin Jul 23 '15 at 17:49
  • Normally I'd just have our data scientist do this for me, but he is gone now. I wear many different hats so I'm not able to dedicate as much time to this as i'd like. I'm pretty resourceful (have a Master's in CS) and, normally, I'd just scour the internet until I found a hack. So, apologies if this seems rudimentary, as I am trying to do 20 things at once. – Dylan_Larkin Jul 23 '15 at 17:57
  • Alternatively, you could fix the CSV before importing it? If the format is exactly predictable (every row is A1,B1,C1,A2,B2,C2,A3,B3,C3), you could use your scripting/regex tool of choice to reshape into (A1,B1,C1\nA2,B2,C2\nA3,B3,C3). – Jeremy Needle Jul 23 '15 at 18:12
  • Examples of your current "one by one" solution would help us understand the problem. – ARobertson Jul 23 '15 at 19:00
  • I figured it out, thanks. – Dylan_Larkin Jul 23 '15 at 19:22
  • Am I right to think that each block of columns represents a different group or id (e.g., stock or commodity)? – ulfelder Jul 24 '15 at 11:34

1 Answers1

1

You said in the comments that you'd figured this out, but this was a useful practice problem for me, so I thought I would go ahead and post an answer in case it's useful to others, and, hopefully, to find out if there are more efficient ways to do this.

I did two versions: one that assumes the sets of columns are different observations of the same thing,and another that assumes the sets of columns represent different groups that you're observing with the same measures.

Here's the data I used; the last row includes different dates to make sure the process is resilient to that situation:

Date    PX_SETTLE   CHG_NET_1D  Date.1  PX_SETTLE.1 CHG_NET_1D.1
12/7/1988   91.07   0.03    12/7/1988   90.93   -0.02
12/8/1988   91.09   0.02    12/8/1988   NA  -0.02
12/9/1988   91.1    0.01    12/9/1988   90.88   NA
12/12/1988  91.1    0   12/12/1988  90.86   -0.02
12/15/1988  91.12   0.02    12/15/1988  90.76   -0.1
12/16/1988  91.2    0.08    12/20/1988  90.96   0.20

Multiple measures of single entity

This version assumes that the sets of columns are different observations of the same thing (e.g., prices of a particular commodity on particular date across multiple exchanges), so in the end you want just one row per date with columns indexed by sets of observations (e.g., sources).

# If necessary, add an index value to first block of columns in df; see http://stackoverflow.com/questions/8898501/grepl-search-within-a-string-that-does-not-contain-a-pattern
names(df) <- c(paste0(names(df)[!grepl("\\.", names(df))], ".0"), names(df[grepl("\\.", names(df))])
# Make vector that will serve as index to columns in df
indx <- gsub(".*\\.", "", names(df))
# create separate dfs in workspace based on that index; see http://stackoverflow.com/questions/27501615/splitting-a-dataframe-by-column-name-indices
list2env(
    setNames(
        lapply(split(colnames(df), indx), function(x) df[x]),
            paste('df', sort(unique(indx)), sep="_")),
    envir=.GlobalEnv)
# Create function to scrub each df by simplifying Date name and setting is as key
fixit <- function(dataframe) {
    require(data.table)
    colnames(dataframe) <- sub("Date.*", "Date", colnames(dataframe))
    dataframe$Date <- as.Date(dataframe$Date, format="%m/%d/%Y")
    DT <- setkey(data.table(dataframe), Date)
    return(DT)
}
# Make a list of data frames in the workspace with names that fit the pattern created a couple of steps above
dflist <- Filter(function(x) is(x, "data.frame"), mget(ls(pattern = "_[0-9]")))
# Apply the scrubbing function to the items in that list, returning scrubbed list
newdflist <- lapply(dflist, fixit)
# Merge the elements of that new list, keeping all values and keying on Date
newdf <- Reduce(function(...) merge(..., all=TRUE), newdflist)

Here's what the result of that process looks like. The suffixes on the non-date variables denote the different sets of observations:

> newdf
         Date PX_SETTLE.0 CHG_NET_1D.0 PX_SETTLE.1 CHG_NET_1D.1
1: 1988-12-07       91.07         0.03       90.93        -0.02
2: 1988-12-08       91.09         0.02          NA        -0.02
3: 1988-12-09       91.10         0.01       90.88           NA
4: 1988-12-12       91.10         0.00       90.86        -0.02
5: 1988-12-15       91.12         0.02       90.76        -0.10
6: 1988-12-16       91.20         0.08          NA           NA
7: 1988-12-20          NA           NA       90.96         0.20

Common measures of multiple entities

Now let's assume that each block of columns represents a distinct entity, like a ticker symbol, measured the same way on overlapping sets of dates. Here we'll repeat most of the process above, except that now the scrubbing function won't bother with keying the list elements, because we want to allow them to repeat (across groups). Instead, the function will add an indexing variable and simplify all the variable names across list elements to a common format to facilitate the merging that comes at the end.

names(df) <- c(paste0(names(df)[!grepl("\\.", names(df))], ".0"), names(df[grepl("\\.", names(df))])
indx <- gsub(".*\\.", "", names(df))
list2env(
     setNames(
       lapply(split(colnames(df), indx), function(x) df[x]),
           paste('df', sort(unique(indx)), sep="_")), 
     envir=.GlobalEnv)
dflist <- Filter(function(x) is(x, "data.frame"), mget(ls(pattern = "_[0-9]")))
fixit2 <- function(dataframe) {
    z <- dataframe
    z[,"index"] <- unlist(strsplit(colnames(z)[1], "\\."))[2]
    colnames(z) <- sub("\\.[0-9]", "", colnames(z))
    return(z)
}
newdflist2 <- lapply(dflist, fixit2)
newdf2 <- Reduce(function(...) merge(..., all=TRUE), newdflist2)
newdf2 <- newdf2[order(newdf2$index, newdf2$Date),]

And here's what the output from that looks like:

> newdf2
         Date PX_SETTLE CHG_NET_1D index
2  1988-12-07     91.07       0.03     0
3  1988-12-08     91.09       0.02     0
6  1988-12-09     91.10       0.01     0
8  1988-12-12     91.10       0.00     0
10 1988-12-15     91.12       0.02     0
11 1988-12-16     91.20       0.08     0
1  1988-12-07     90.93      -0.02     1
4  1988-12-08        NA      -0.02     1
5  1988-12-09     90.88         NA     1
7  1988-12-12     90.86      -0.02     1
9  1988-12-15     90.76      -0.10     1
12 1988-12-20     90.96       0.20     1

Both approaches could be simplified with piping or nesting some of the later steps.

ulfelder
  • 5,305
  • 1
  • 22
  • 40
  • It's the second case. – Dylan_Larkin Jul 24 '15 at 14:09
  • Actually, it's the first case because I still want each column of SETTLE and NET CHANGE to stay independent of the others. All I really wanted to do was collapse the date's into one and align them. So your first solution would work better. – Dylan_Larkin Jul 24 '15 at 14:11
  • I'm getting "Error in as.Date.default(dataframe$Date, format = "%m/%d/%Y") : do not know how to convert 'dataframe$Date' to class “Date” " when trying to execute second-to-last line of code. Perhaps because I converted to YYYY-MM-DD in Excel before reading from CSV. – Dylan_Larkin Jul 24 '15 at 15:44
  • Or more likely because newdflist now contains every data frame that was in my original environment? – Dylan_Larkin Jul 24 '15 at 16:11
  • Yeah, I'm starting with a data frame in which the Date columns are formatted as string, not dates. If your initial data frame has them as dates, then you should be able to delete the line in the function that tries to convert them and then get it to work. – ulfelder Jul 24 '15 at 16:12
  • newdflist should only capture ones with "_[0-9]" in them. – ulfelder Jul 24 '15 at 16:13
  • Haha, ya unfortunately that's quite a few. – Dylan_Larkin Jul 24 '15 at 16:17