I have a quite large data file (16 cols, over 600,000 rows) and I need to calculate the mean of the first 2 values and mean of the last two values after/before the blank rows (missing values due to instrument's internal calibration) - basically multiple individual tables within the data frame separated by blank rows. However, the data is rather slightly irregular, i.e. in the vast majority of cases there are 21 rows of data followed by 4 blank rows (i.e. to calculate the mean of 1st and 2nd value (skip 17 rows) AND 20th and 21st value...4 blank rows...mean of 26th and 27th (skip 17 rows) AND 45th and 46th...4 blanks...etc), but occasionally the number of blanks may vary so I need to employ a method that takes into account the position of the blank rows since they are the key to know which values are the fist two and last two in each column section (in the example made only for the mean of Conc
columns), but I keep on failing to adjust the script. I can't post pics yet so there's also a link to example of data in question. Any ideas?
Asked
Active
Viewed 124 times
-2

Norberto Serfozo
- 1
- 1
-
2Welcome to Stack Overflow! Please provide a sample of your data as text (`dput` helps with that) instead of an image. – Ralf Stubner Jul 10 '18 at 14:33
-
thanx. originally i intended to add a table not a pic, but im not familiar with this environment and as i found also earlier dput() suggestions to insert a table, i've tried, but it always gives me weird data structure. – Norberto Serfozo Jul 11 '18 at 10:16
-
See https://stackoverflow.com/q/5963269/8416610 for examples. – Ralf Stubner Jul 11 '18 at 10:19
2 Answers
0
This does the job for one column, you can use lapply to apply it to all
First create data frame with similar structure:
set.seed(1)
#random numbers
df=data.frame(start_time=1:100, end_time=2:101, conc370=runif(100))
#index for block seperating rows
m=seq(7,by=24,95)
# at each m delete 4 rows
df$conc370[c(m, m+1, m+2, m+3)]=NA
df should now have a similar structure to your data, so we can start:
# detect empty lines:
na_idx=is.na(df$conc370)
# the last line of the block of numbers:
end_idx=which(diff(na_idx)==1)
# the first line of the block:
start_idx=which(diff(na_idx)==-1)+1
# get the position where we want to compute the average
pos=sort(c(end_idx-1, start_idx))
# compute the average at the positions
mean_vals=sapply(pos, function(idx,x) mean(x[c(idx, idx+1)]), x=df$conc370 )
# add to df
df$mean=NA
df$mean[pos]=mean_vals

Ma Li
- 131
- 6
-
It gave me some numbers, but it is certainly incorrect result. Can u please explain why u included start_time and end_time and the m=seq(7,by=24,95) part? – Norberto Serfozo Jul 11 '18 at 10:11
-
Since you did not include the data, I created a data frame with a similar structure. The m=seq(7,by=24,95) is just to identify columns that are deleted in order to make blank lines. If you apply the lower part on your own data frame it should return the correct numbers. I'll add some comments to make it clear. – Ma Li Jul 11 '18 at 11:39
0
My solution isn't particularly elegant, but I was partly successful with solution, i.e. in the case no data is missing (blank cells) except for those already accounted for. However, it needs a little extra effort to work out the details:
data$index = c(1:dim(data)[1])
head(data)
data$missing = c(rep(1, dim(data)[1]))
Length = length(data$missing[is.na(data$Conc370)])
data$missing[is.na(data$Conc370)] = c(rep(0, Length))
data$Conc370.mean = c(rep(NA, dim(data)[1]))
Length = length(data$missing[is.na(data$Conc470)])
data$missing[is.na(data$Conc470)] = c(rep(0, Length))
data$Conc470.mean = c(rep(NA, dim(data)[1]))
Length = length(data$missing[is.na(data$Conc520)])
data$missing[is.na(data$Conc520)] = c(rep(0, Length))
data$Conc520.mean = c(rep(NA, dim(data)[1]))
Length = length(data$missing[is.na(data$Conc590)])
data$missing[is.na(data$Conc590)] = c(rep(0, Length))
data$Conc590.mean = c(rep(NA, dim(data)[1]))
Length = length(data$missing[is.na(data$Conc660)])
data$missing[is.na(data$Conc660)] = c(rep(0, Length))
data$Conc660.mean = c(rep(NA, dim(data)[1]))
Length = length(data$missing[is.na(data$Conc880)])
data$missing[is.na(data$Conc880)] = c(rep(0, Length))
data$Conc880.mean = c(rep(NA, dim(data)[1]))
Length = length(data$missing[is.na(data$Conc950)])
data$missing[is.na(data$Conc950)] = c(rep(0, Length))
data$Conc950.mean = c(rep(NA, dim(data)[1]))
i = 0
number = 0
repeat{
i = i + 1
if (data$missing[i] == 0){
repeat{
number = number + 1
if ((number == 1) & (i > 3)){ # if you are on the first missing row
data$Conc370.mean[i-1] = mean(c(data$Conc370[i-1], data$Conc370[i-2]))
data$Conc470.mean[i-1] = mean(c(data$Conc470[i-1], data$Conc470[i-2]))
data$Conc520.mean[i-1] = mean(c(data$Conc520[i-1], data$Conc520[i-2]))
data$Conc590.mean[i-1] = mean(c(data$Conc590[i-1], data$Conc590[i-2]))
data$Conc660.mean[i-1] = mean(c(data$Conc660[i-1], data$Conc660[i-2]))
data$Conc880.mean[i-1] = mean(c(data$Conc880[i-1], data$Conc880[i-2]))
data$Conc950.mean[i-1] = mean(c(data$Conc950[i-1], data$Conc950[i-2]))
}
if (data$missing[i + number] == 1){
data$Conc370.mean[i + number] = mean(c(data$Conc370[i + number], data$Conc370[i + number + 1]))
data$Conc470.mean[i + number] = mean(c(data$Conc470[i + number], data$Conc470[i + number + 1]))
data$Conc520.mean[i + number] = mean(c(data$Conc520[i + number], data$Conc520[i + number + 1]))
data$Conc590.mean[i + number] = mean(c(data$Conc590[i + number], data$Conc590[i + number + 1]))
data$Conc660.mean[i + number] = mean(c(data$Conc660[i + number], data$Conc660[i + number + 1]))
data$Conc880.mean[i + number] = mean(c(data$Conc880[i + number], data$Conc880[i + number + 1]))
data$Conc950.mean[i + number] = mean(c(data$Conc950[i + number], data$Conc950[i + number + 1]))
i = i + number - 1
number = 0
break
}
}
}
if (i >= dim(data)[1]){
break
}
}

Norberto Serfozo
- 1
- 1