1

I am in the process of developing a script to aggregate multiple system generated csv files. Below is the error I am running into and I believe this is occuring because the csv files are generated with 14 standardized column headers, but every so often there will be data in additional columns with no headers.

I am stuck on how to concatenate the no header column data in with the 14th column because they appear to be additional memos and need to be retained.

Error:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : line 521 did not have 14 elements

Data in row 522:

> scan("1428477090.csv", "character", skip=521, n=1, sep="\n")
Read 1 item
[1] "207.4,64.6,1.6,70,0.970,169.50,281,0.4,68,175.40,0.37,2015/04/08,04:33:20,BIT DEPTH CHANGED TO 116.0 FEET,HOLE DEPTH CHANGED TO 116.0 FEET"

Code:

serverPath = "C:/Users/*****/Desktop/Pason/"
filenames = list.files(path = serverPath, pattern = '[.]csv')
idx=1
df = read.table(file = paste(serverPath, filenames[idx], sep = ""), header = T, sep =",", na.strings = "-999.25", check.names=F)

CSV format and data:

Hole Depth,Hook Load,Weight on Bit,Rotary RPM,Convertible Torque,On Bottom ROP,Total Pump Output,Differential Pressure,Standpipe Pressure,Rate Of Penetration,Time Of Penetration,YYYY/MM/DD,HH:MM:SS,Memos
2531.4,42.6,0.0,0,0.000,0.00,0,-1141.7,0,0.00,0.00,2015/04/08,01:40:00,
2531.4,42.5,0.0,0,0.000,0.00,0,-1141.7,0,0.00,0.00,2015/04/08,01:40:20,
2531.4,42.5,0.0,0,0.000,0.00,0,-1141.7,0,0.00,0.00,2015/04/08,01:40:40,
2531.4,42.8,0.0,0,0.000,0.00,0,-1141.7,0,0.00,0.00,2015/04/08,01:41:00,
HeyHeyJJ
  • 27
  • 1
  • 6
  • I would say you have more headers than columns. why do you say that you have extra columns with no headers? – rawr Jul 20 '15 at 15:35
  • The system generates the csv column headers but it will output data in columns 15, 16, 17, etc.... which do not have headers – HeyHeyJJ Jul 20 '15 at 15:40
  • maybe you can read them separately http://stackoverflow.com/questions/5788117/only-read-limited-number-of-columns-in-r and since you know the headers beforehand, it should be straight-forward – rawr Jul 20 '15 at 15:53

2 Answers2

0

1) I saved your CSV data to a file "a.csv" in a folder "data". Function read.csv works fine for me, just the last column was filled with NA's:

read.csv("./data/a.csv")

# Hole.Depth Hook.Load Weight.on.Bit Rotary.RPM Convertible.Torque On.Bottom.ROP Total.Pump.Output
# 1     2531.4      42.6             0          0                  0             0                 0
# 2     2531.4      42.5             0          0                  0             0                 0
# 3     2531.4      42.5             0          0                  0             0                 0
# 4     2531.4      42.8             0          0                  0             0                 0
# Differential.Pressure Standpipe.Pressure Rate.Of.Penetration Time.Of.Penetration YYYY.MM.DD HH.MM.SS
# 1               -1141.7                  0                   0                   0 2015/04/08 01:40:00
# 2               -1141.7                  0                   0                   0 2015/04/08 01:40:20
# 3               -1141.7                  0                   0                   0 2015/04/08 01:40:40
# 4               -1141.7                  0                   0                   0 2015/04/08 01:41:00
# Memos
# 1    NA
# 2    NA
# 3    NA
# 4    NA

# Warning message:
#     In read.table(file = file, header = header, sep = sep, quote = quote,  :
#                       incomplete final line found by readTableHeader on './data/a.txt'

2) I added your string as the last row in the file a.csv:

bad_string <- "207.4,64.6,1.6,70,0.970,169.50,281,0.4,68,175.40,0.37,2015/04/08,04:33:20,BIT DEPTH CHANGED TO 116.0 FEET,HOLE DEPTH CHANGED TO 116.0 FEET"

Following code works fine for me too:

serverPath = "./data/"
list.files(path = serverPath, pattern = '[.]csv')
idx=1
df = read.csv(file = paste(serverPath, filenames[idx], sep = ""),
              na.strings = "-999.25")

Except from the fact that in section "Memos" part of a string after the 14th comma in a row is missing.

3) There should be no commas under "Memos" section, so each row in a file (let's call it "bad_string") should contain 13 commas (as you have 14 columns). In one row I suggest replacing all commas, which number is above 13th, by semicolons (or other symbol) either manually or incorporating this code (below) to your analysis. I think, its possible to write something more efficient, but this one works too:

CtoS <- function(bad_string){
    # If in string bad_string there are more than 13 commas function 
    # CtoS (comma to semicolon), replaces all commas to semicolons which 
    # number is above 13.
    indices_of_commas <- which(strsplit(bad_string, "")[[1]]==",") # searching for indices of commas
    number_of_commas  <- length(indices_of_commas) # calculating number of commas
    if (number_of_commas >= 14) # if there are too many commas (i.e. additional commas in "Memos" section), they should be replaced:
    {
        indices_of_commas_to_replace <- c(indices_of_commas[14:number_of_commas])
        tmp<-unlist(strsplit(bad_string,""))
        tmp[indices_of_commas_to_replace]<-c(';')
        no_commas_in_Memos_section  <- paste0(tmp,collapse='')
        good_string <- no_commas_in_Memos_section
    }
    else {good_string <- bad_string; return(good_string)}
}


lines_from_file <- scan("./data/a.csv", "character", sep="\n")
# replace unnecessary commas by using function CtoS():
corrected_lines <- unlist(lapply(lines_from_file,CtoS))

There should be a way, to convert these string to data frame directly. Unfortunately, it's beyond my knowledge. My solution is here:

# NOTE!!! Always have have a copy of your original files in 
# a separate directory to prevent overwriting.

dir.create("./data copy/") # a new directory for processed files.

# Give name to a new file, that is different from  original filename.
# I gave the other extension (.txt instead of .csv) and created a new folder.
fileConnection<-file("./data copy/a.txt") # save to a new file.
writeLines(corrected_lines, fileConnection)
close(fileConnection)

Load your new file as a data frame:

df = read.csv(file = "./data copy/a.txt", na.strings = "-999.25")
print(df)

Column "Memos" after this procedure:

                                                             Memos
1                                                                 
2                                                                 
3                                                                 
4                                                                 
5 BIT DEPTH CHANGED TO 116.0 FEET;HOLE DEPTH CHANGED TO 116.0 FEET
GegznaV
  • 4,938
  • 4
  • 23
  • 43
  • My apologise, I have added the corrected data in the CSV format to my original post. The files open by default through Excel and I quickly copy/pasted the sample in my first post – HeyHeyJJ Jul 20 '15 at 17:25
  • Find my new answer to your question above. – GegznaV Jul 21 '15 at 16:59
0

Thank you all for your input. It was determined the extra comments did not need to be retained. I used the following code to omit the additional rows created from the no header file scan error:

#Remove rows with NA values
dfAllData <- na.omit(dfAllData)
HeyHeyJJ
  • 27
  • 1
  • 6
  • Be careful with this. It may also remove rows that you don't want removed Like important data rows that may contain just one NA but not in the last column, those will also be removed with `na.omit()` – Rich Scriven Aug 12 '15 at 13:26