I have a very large (1.5 GB) malformed CSV file I need to read into R, and while the file itself is a CSV, the delimiters break after a certain number of lines due to poorly-placed line returns.
I have a reduced example attached, but a truncated visual representation of that looks like this:
SubID,Date1,date2,var1,var2,broken_column_var
000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[ -0.00000000 0.00000000 -0.00000000 -0.00000000 0.00000000
-0.00000000 -0.00000000 0.00000000 0.00000000 0.00000000
0.00000000 0.00000000 0.00000000]
[ -0.00000000 -0.0000000 -0.00000000 -0.00000000 -0.0000000
-0.0000000 -0.0000000 0.00000000 0.00000000 -0.00000000
-0.00000000 0.00000000 0.0000000 ]]"
000000000,1111-11-11,1111-11-11,1,SECOND TEXT FOR ZERO,"[[ 1.11111111 -1.11111111 -1.1111111 -1.1111111 1.1111111
1.11111111 1.11111111 1.11111111]]"
000000000,2222-22-22,2222-22-22,2,THIRD TEXT FOR ZERO,"[[-2.2222222 2.22222222 -2.22222222 -2.22222222 2.2222222 -2.22222222
-2.22222222 -2.22222222 -2.22222222 2.22222222 2.22222222 2.22222222]
[-2.22222222 -2.22222222 2.22222222 2.2222222 2.22222222 -2.22222222
2.2222222 -2.2222222 2.22222222 2.2222222 2.222222 -2.22222222]
[-2.22222222 -2.2222222 2.22222222 2.2222222 2.22222222 -2.22222222
-2.22222222 -2.2222222 -2.22222222 2.22222222 2.2222222 2.22222222]
[-2.22222222 -2.22222222 2.2222222 2.2222222 2.2222222 -2.22222222
-2.222222 -2.2222222 -2.2222222 -2.22222222 2.22222222 2.2222222 ]
[-2.22222222 -2.222222 2.22222222 2.22222222 2.22222222 -2.2222222
-2.2222222 -2.2222222 -2.2222222 -2.22222222 2.22222222 -2.222222 ]
[ 2.22222222 -2.22222222 -2.222222 -2.222222 -2.2222222 -2.22222222
-2.222222 -2.22222222 2.2222222 -2.2222222 2.2222222 2.22222222]]"
111111111,0000-00-00,0000-00-00,00,FIRST TEXT FOR ONE,"[[ -0.00000000 0.00000000 -0.00000000 0.000000 -0.00000000
-0.00000000 0.00000000 0.00000000]]"
New lines and all as /n's in the CSVs.
To get around loading it all into memory and attempting to parse it as a dataframe in other environments, I have been trying to print relevant snippets from the CSV to the terminal with character returns removed, empty spaces collapsed, and commas entered in-between variables.
Like the following:
000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000],[-0.00000000,-0.0000000,-0.00000000,-0.00000000,-0.0000000,-0.0000000,-0.0000000,0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.0000000]]"
My main attempt to pull all the information from everything from a line between parentheses and brackets with:
awk '/\"\[\[/{found=1} found{print; if (/]]"/) exit}' Malformed_csv_Abridged.csv | tr -d '\n\r' | tr -s ' ' | tr ' ' ','
outputting:
000000000,0000-00-00,0000-00-00,0,FIRST,TEXT,FOR,ZERO,"[[,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000],[,-0.00000000,-0.0000000,-0.00000000,-0.00000000,-0.0000000,-0.0000000,-0.0000000,0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.0000000,]]"
Gets close, but:
- It only prints the first instance so I need a way to find the other instances.
- It's inserting commas into places in blank spaces before the characters I'm searching for (
"[[]]"
), which I don't need it to do. - It leaves some extra commas by the brackets that I haven't quite found the right call to
tr
for to remove due to the necessary escape characters.