Possible solution for data cleaning in R

Question

I have a dataset which contains two date and time columns. I would like to line up the date and time columns in the same row and delete everything to the right of the second date and time column if they don't match. A sample of my dataset problem looks like the following.

Note that Time.1!=Time. A cleaned final version of the dataset should look like this:

I would like to remove everything to the right of the column Date.1 until both Date.1=Date and Time.1=Time.

I have wrote the following R code to do this:

rm(list = ls())
data<-read.csv(file="data.csv",header=T,stringsAsFactors=FALSE)
datacol<-ncol(data)
datarow<-nrow(data)

for (i in 1:datarow){
  for (j in 1:(datarow-1)){
    while(data$Date[i]!=data$Date.1[j] | data$Time[i]!=data$Time.1[j]){data[j,6:datacol]<-data[j+1,6:datacol]; 
    if (j<datarow) next; if (j == datarow) break; print(data[i,]);}
  }
}

However R just keeps running without stop even though I specified break when the length of the row is met. I would appreciate any advice on how to solve this issue.

Thank you.

Please read (1) [how do I ask a good question](http://stackoverflow.com/help/how-to-ask), (2) [How to create a MCVE](http://stackoverflow.com/help/mcve) as well as (3) [how to provide a minimal reproducible example in R](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example#answer-5963610). Then edit and improve your question accordingly. I.e., provide a minimal input data set for use in R (e.g. the result of `dput(mydata)`, no screenshots) and the expected output. — lukeA, May 08 '16 at 11:07
Without a sample data set, it is difficult to help. You should note in your code that both of your `for` loops run through "datarow" and poor "datacol" is left out in the cold. — lmo, May 08 '16 at 11:19
Also notice that j maxes out at datarow-1, so that your break condition (`j == datarow`) is never realized. — lmo, May 08 '16 at 17:01

score 1 · Accepted Answer · answered May 08 '16 at 17:16

Consider subsetting by columns and then merging your dataset without need of nested for loops. Below reproduces data with filled in values for demonstration:

df <- read.table(text="Date     Time    Price   MA        X     Date.1      Time.1    Open      High    Low     Close   BMA     Vol
                       5/2/2016 5:00    9810+   98073.07  NA    5/2/2016    5:00      130035    130040  130020  130035  130020  7014
                       5/2/2016 0:30    98080   98071     NA    5/2/2016    4:30      130030    130035  130020  130035  130017  6631
                       5/2/2016 0:00    9809+   98070.05  NA    5/2/2016    4:00      130030    130035  130020  130030  130016  2465
                       5/1/2016 23:30   9809+   98066     NA    5/2/2016    3:30      130025    130035  130020  130030  130015  2918
                       5/1/2016 23:00   9809+   98063.03  NA    5/2/2016    3:00      130030    130030  130020  130020  130012  2289
                       5/1/2016 22:30   98080   98061     NA    5/2/2016    2:30      130035    130035  130030  130030  130010  4699
                       5/3/2016 12:30   7777    7777      NA    5/2/2016    0:30      130030    130035  130010  130015  130005  4424
                       5/3/2016 12:45   8888    8888      NA    5/2/2016    0:00      130030    130035  130025  130030  130005  1906
                       5/4/2016 13:30   78787   78787     NA    5/1/2016    23:30     130020    130035  130015  130030  130002  5429
                       5/4/2016 13:45   87878   87878     NA    5/1/2016    23:00     130015    130020  130015  130020  130001  540
                       5/4/2016 13:50   77888   88777     NA    5/1/2016    22:30     130010    130020  130010  130015  130000  1417
                 ", header = TRUE)

# SUBSET INTO TWO SEPARATE DFs
df1 <- df[c('Date', 'Time', 'Price', 'MA', 'X')]
df2 <- df[c('Date.1', 'Time.1', 'Open', 'High', 'Low', 'Close', 'BMA', 'Vol')]

# MERGE BY DATE AND TIME COLUMNS 
#  (NOTE: LOSE SECOND PAIR OF JOIN VARS --Date.1, Time.1-- WITH DF SORT)
finaldf <- merge(df1, df2, by.x = c('Date', 'Time'), by.y = c('Date.1', 'Time.1'))
finaldf

#       Date  Time Price MA       X  Open   High   Low    Close  BMA    Vol
# 1 5/1/2016 22:30 98080 98061.00 NA 130010 130020 130010 130015 130000 1417
# 2 5/1/2016 23:00 9809+ 98063.03 NA 130015 130020 130015 130020 130001  540
# 3 5/1/2016 23:30 9809+ 98066.00 NA 130020 130035 130015 130030 130002 5429
# 4 5/2/2016  0:00 9809+ 98070.05 NA 130030 130035 130025 130030 130005 1906
# 5 5/2/2016  0:30 98080 98071.00 NA 130030 130035 130010 130015 130005 4424
# 6 5/2/2016  5:00 9810+ 98073.07 NA 130035 130040 130020 130035 130020 7014

Possible solution for data cleaning in R

1 Answers1