Is there a way in R to join broken lines of csv file?

Question

I have a program that exports a csv file but doesn't quote new lines or use /n instead of /n/r. It uses the same end of line in the middle of records as it does at the end. The program does however use a comma delimiter between variables. How can I tell R to delete all the eol markers until the number of variables in data is reached?

My data would look like this:

name, rank, serial number, age, height, weight

mike, noob, 123456, 22, 6, 34.4

bob, officer, 345

323, 24, 6, 2

3.5

ted, officer, 34234, 2

5, 6, 35.2

How would I basically delete the CR after the 5 in row 2, after the 2 in row 3 and after the 2 in row 6? There should be 5 commas in each row and 6 variables. My data doesn't have the extra line between each row. I just couldn't get it to stop putting it all on one line without doing that. My data is 43 variables and is constantly generating new lines. Most of the time it is read in there are a few thousand lines. About 20% of them have the CR problem.

Also want to add that a new row will always start on a new row, it will not follow on the same line as the previous if that makes sense.

the data frame should look like this:

name, rank, serial number, age, height, weight

mike, noob, 123456, 22, 6, 34.4

bob, officer, 345323, 24, 6, 23.5

ted, officer, 34234, 25, 6, 35.2

This is what my data looks like if that helps. The first line is a header followed by what should be 6 records but read.csv and fread and everything else I tried gives me 10 records. The 6th record has the extra CR, but still has 42 variables. Just broken up into 5 lines.

EFPCName,EFUseAPPE,log pdl,pdl error,device pretty name,num pages,num sheets,copies printed,total pages printed,total sheets printed,total color pages printed,total bw pages printed,total tab pages printed,total sample pages printed,num copies,print status,instructions,notes1,notes2,username,noneutf8lastuser,non utf8 submitted by,title,size,logical printer,fiery,time,date,total rip duration,timestamp spooling,timestamp done spooling,timestamp waiting to rip,timestamp ripping,timestamp done ripping,timestamp waiting to print,timestamp printing,timestamp done printing,media weight,input slot,media size,media type,interpreter,

LZX  Laser 24 - 11 x 17 Tabloid,,postscript,,Canon,2,1,1,2,1,1,1,0,0,1,OK,,,,TeamMember,,TeamMember,78053.01.pdf,4004491,Canon hold,SERVER-Shredder,2013 06 07 19 37 13,2013 06 07 19 37 00,3,2013 06 07 19 37 23,2013 06 07 19 37 24,2013 06 07 19 38 02 118342,2013 06 07 19 38 02 118342,2013 06 07 19 38 09,2013 06 07 19 38 09,2013 06 07 19 38 38,2013 06 07 19 39 19 124419,,Tray5,Tabloid,Plain,PS,

LZX  Laser 24 - 11 x 17 Tabloid,,postscript,,Canon,2,1,1,2,1,1,1,0,0,1,OK,,,,TeamMember,,TeamMember,78053.01.pdf,4004520,none,SERVER-Shredder,2013 06 07 19 37 13,2013 06 07 19 37 00,,2013 06 07 19 44 07 926090,2013 06 07 19 44 07 926744,2013 06 07 19 44 07 926090,2013 06 07 19 44 07 926090,2013 06 07 19 44 07 926744,2013 06 07 19 44 07,2013 06 07 19 44 11,2013 06 07 19 44 53 141084,,Tray5,Tabloid,Plain,PS,

LZX  Laser 24 - 11 x 17 Tabloid,,postscript,,Canon,2,1,1,2,1,1,1,0,0,1,OK,,,,TeamMember,,TeamMember,78053.01.pdf,4004520,none,SERVER-Shredder,2013 06 07 19 37 13,2013 06 07 19 37 00,,2013 06 07 19 46 01 550964,2013 06 07 19 46 01 551451,2013 06 07 19 46 01 550964,2013 06 07 19 46 01 550964,2013 06 07 19 46 01 551451,2013 06 07 19 46 01,2013 06 07 19 46 05,2013 06 07 19 46 46 911557,,Tray5,Tabloid,Plain,PS,

LZX80  Color Copy Cover - 11 x 17 Tabloid,,postscript,,Canon,1,2,2,2,2,2,0,0,0,2,OK,,,,TeamMember,,TeamMember,78011.01.pdf,874486,Canon hold,SERVER-Shredder,2013 06 07 19 47 07,2013 06 07 19 47 00,3,2013 06 07 19 47 17,2013 06 07 19 47 17 507576,2013 06 07 19 47 47 960542,2013 06 07 19 47 47 960542,2013 06 07 19 47 51,2013 06 07 19 47 51,2013 06 07 19 47 54,2013 06 07 19 48 25 77595,,Tray3,Tabloid,Heavy5,PS,

LZX  Laser 24 - 11 x 17 Tabloid,,postscript,,Canon,2,1,1,2,1,1,1,0,0,1,OK,,,,TeamMember,,TeamMember,78053.01.pdf,4004520,none,SERVER-Shredder,2013 06 07 19 37 13,2013 06 07 19 37 00,,2013 06 07 19 48 04 501212,2013 06 07 19 48 04 502522,2013 06 07 19 48 04 501212,2013 06 07 19 48 04 501212,2013 06 07 19 48 04 502522,2013 06 07 19 48 04,2013 06 07 19 48 07,2013 06 07 19 48 48 188474,,Tray5,Tabloid,Plain,PS,

EX32  Laser 32 - 11 x 17 Tabloid,,pdf,,Canon,63,64,1,63,64,4,59,0,0,1,OK,Size: 11 x 17
Finishing: Coil Binding  Cutting  Punching
Pages: 
1-63  4/0  EX32  Laser 32 - 11 x 17  11 x 17
 ,Color 77992:01Employee Handbook REVISED_2up(NFC).pdf, McAllen TX,EFI Pace,,,Color 77992:01Employee Handbook REVISED_2up(NFC).pdf,518880,none,SERVER-Shredder,2013 06 07 20 01 52,2013 06 07 20 01 00,3,2013 06 07 20 02 41 495216,2013 06 07 20 02 44 780196,2013 06 07 20 02 41 871208,2013 06 07 20 02 41 871208,2013 06 07 20 02 45,2013 06 07 20 02 45,2013 06 07 20 03 25,2013 06 07 20 05 45 741386,,Tray4,Tabloid,Heavy1,PS,

score 1 · Answer 1 · answered Jul 27 '16 at 05:11

1

If you want to implicitly add blank fields when you have rows of unequal length, set fill = TRUE in your read.table call.

If that's not the question you are asking, can you be more clear and provide a reproducible example?

answered Jul 27 '16 at 05:11

danh

618
3
7

Yes, I don't want to just fill the end of the record with blanks and the beginning of the next with blanks. I need for the data frame to look like the edit to my question. – Michael Ciesielczyk Jul 27 '16 at 05:13
Is whitespace meaningful in your data or can you delete all of it? I mean - do you need spaces? – nya Jul 27 '16 at 06:25
Yes I need the white space inside the variables. – Michael Ciesielczyk Jul 27 '16 at 06:30
Welcome to stack overflow! Can you improve this question by sharing a [minimal example which is reproducible](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)? – shayaa Jul 27 '16 at 06:42

nya · Accepted Answer · 2016-07-27T14:10:44.960

1

This is what I have for now. See how this works on your data.

dat <- readLines("temp.txt") # read whatever is in there, one line at a time
varnames <- unlist(strsplit(dat[1], ",")) # extract variable names
nvar <- length(varnames)

k <- 1 # setting up a counter
dat1 <- matrix(NA, ncol = nvar, dimnames = list(NULL, varnames))

while(k <= length(dat)){
    k <- k + 1
    if(dat[k] == "") {k <- k + 1
        print(paste("data line", k, "is an empty string"))
        if(k > length(dat)) {break}
    }
    temp <- dat[k]
    # checks if there are enough commas or if the line was broken
    while(length(gregexpr(",", temp)[[1]]) < nvar-1){
        k <- k + 1
        temp <- paste0(temp, dat[k])
    }
    temp <- unlist(strsplit(temp, ","))
    message(k)
    dat1 <- rbind(dat1, temp)
}

dat1 = dat1[-1,] # delete the empty initial row

The general idea is to keep collapsing text until there are enough commas in the string. Once that is achieved, the data is split at commas and added as a single row into a matrix. The code is horribly clunky and will be slow for large data files. It is the best I can do though.

For the original data example, the code works and creates a character matrix with 42 columns and 6 rows. For the smaller example, the code cannot handle the break in the last column.

edited Jul 27 '16 at 14:10

answered Jul 27 '16 at 07:35

nya

2,138
15
29

I'm getting this error : Error in if (dat[k] == "") { : missing value where TRUE/FALSE needed In addition: There were 50 or more warnings (use warnings() to see the first 50). This is also getting read in and displayed with a shiny app. There would be no way for user intervention. Is it possible to toss the line if it really did have some sort of problem? I have yet to see one that did but it could be possible I guess. – Michael Ciesielczyk Jul 27 '16 at 13:33
@MichaelCiesielczyk Did it create some rows in the dat1 table that were correctly identified with your real data? – nya Jul 27 '16 at 13:40
I just keep getting these: [1] "-1,,pdf,,Canon,1,10,10,10,10,10,0,0,0,10,OK,Size: 11 x 17Finishing: Cutting" This seems like a broken line. Type 'n' to quit and fix it yourself: If I hit enter it just gives the error on the next line. I hit enter up to 100. I think there were a couple of lines that looked right but most of the 100 I went through were not right. – Michael Ciesielczyk Jul 27 '16 at 14:01
The first line of my data after the headers is a malformed line if that helps. – Michael Ciesielczyk Jul 27 '16 at 14:04
@MichaelCiesielczyk Okay. I removed the user interaction from the code. Can the code output `dat1` table that is correct at least to a certain point? – nya Jul 27 '16 at 14:11
1

All the non malformed data comes in fine. The data with the extra line breaks are close. It seems they get off at the instructions field. They have both the instruction field and the notes field in the same 'cell' then the rest is off one. Very close! Is there a way to just brinf in the entire file, delete all the returns then count out every 43 commas and put a return in? Would that be even harder to sort out? – Michael Ciesielczyk Jul 27 '16 at 17:06
@MichaelCiesielczyk Is there an extra comma in the instructions field? – nya Jul 27 '16 at 17:11
Nope, just 2, 3 or sometimes 4 line breaks. The instructions field is the only one with the extra line breaks. – Michael Ciesielczyk Jul 27 '16 at 17:26
You code works perfectly! I was reading in a file that I had attempted to correct with a batch file on the device it is being read from. Thank you so much and just one more question. How do I bypass the first two lines of the file? there are two 'header' rows before the actual row of headers. – Michael Ciesielczyk Jul 27 '16 at 17:44
I'm glad to hear that. Use `dat <- dat[-1]` as the second line of the code. The number indicates whatever line you need to discard. – nya Jul 27 '16 at 22:57

Is there a way in R to join broken lines of csv file?

2 Answers2

Linked