Parsing a large CSV file with unusual characters, spacing, brackets, and irregular returns in bash

Question

I have a very large (1.5 GB) malformed CSV file I need to read into R, and while the file itself is a CSV, the delimiters break after a certain number of lines due to poorly-placed line returns.

I have a reduced example attached, but a truncated visual representation of that looks like this:

SubID,Date1,date2,var1,var2,broken_column_var
000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[ -0.00000000   0.00000000  -0.00000000  -0.00000000   0.00000000
   -0.00000000  -0.00000000   0.00000000   0.00000000   0.00000000
    0.00000000   0.00000000   0.00000000]
 [ -0.00000000  -0.0000000   -0.00000000  -0.00000000  -0.0000000
   -0.0000000   -0.0000000    0.00000000   0.00000000  -0.00000000
   -0.00000000   0.00000000   0.0000000 ]]"
000000000,1111-11-11,1111-11-11,1,SECOND TEXT FOR ZERO,"[[  1.11111111  -1.11111111  -1.1111111   -1.1111111    1.1111111
    1.11111111   1.11111111   1.11111111]]"
000000000,2222-22-22,2222-22-22,2,THIRD TEXT FOR ZERO,"[[-2.2222222   2.22222222 -2.22222222 -2.22222222  2.2222222  -2.22222222
  -2.22222222 -2.22222222 -2.22222222  2.22222222  2.22222222  2.22222222]
 [-2.22222222 -2.22222222  2.22222222  2.2222222   2.22222222 -2.22222222
   2.2222222  -2.2222222   2.22222222  2.2222222   2.222222   -2.22222222]
 [-2.22222222 -2.2222222   2.22222222  2.2222222   2.22222222 -2.22222222
  -2.22222222 -2.2222222  -2.22222222  2.22222222  2.2222222   2.22222222]
 [-2.22222222 -2.22222222  2.2222222   2.2222222   2.2222222  -2.22222222
  -2.222222   -2.2222222  -2.2222222  -2.22222222  2.22222222  2.2222222 ]
 [-2.22222222 -2.222222    2.22222222  2.22222222  2.22222222 -2.2222222
  -2.2222222  -2.2222222  -2.2222222  -2.22222222  2.22222222 -2.222222  ]
 [ 2.22222222 -2.22222222 -2.222222   -2.222222   -2.2222222  -2.22222222
  -2.222222   -2.22222222  2.2222222  -2.2222222   2.2222222   2.22222222]]"
111111111,0000-00-00,0000-00-00,00,FIRST TEXT FOR ONE,"[[ -0.00000000   0.00000000  -0.00000000   0.000000    -0.00000000
   -0.00000000   0.00000000   0.00000000]]"

New lines and all as /n's in the CSVs.

To get around loading it all into memory and attempting to parse it as a dataframe in other environments, I have been trying to print relevant snippets from the CSV to the terminal with character returns removed, empty spaces collapsed, and commas entered in-between variables.

Like the following:

000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000],[-0.00000000,-0.0000000,-0.00000000,-0.00000000,-0.0000000,-0.0000000,-0.0000000,0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.0000000]]"

My main attempt to pull all the information from everything from a line between parentheses and brackets with:

awk '/\"\[\[/{found=1} found{print; if (/]]"/) exit}'  Malformed_csv_Abridged.csv | tr -d '\n\r' | tr -s ' ' | tr ' ' ','

outputting:

000000000,0000-00-00,0000-00-00,0,FIRST,TEXT,FOR,ZERO,"[[,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000],[,-0.00000000,-0.0000000,-0.00000000,-0.00000000,-0.0000000,-0.0000000,-0.0000000,0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.0000000,]]"

Gets close, but:

It only prints the first instance so I need a way to find the other instances.
It's inserting commas into places in blank spaces before the characters I'm searching for ("[[]]"), which I don't need it to do.
It leaves some extra commas by the brackets that I haven't quite found the right call to tr for to remove due to the necessary escape characters.

probably best suited for unix.stackexchange: https://unix.stackexchange.com/questions/7425/is-there-a-robust-command-line-tool-for-processing-csv-files — njzk2, May 31 '23 at 20:16
also: https://stackoverflow.com/questions/4286469/how-to-parse-a-csv-file-in-bash — njzk2, May 31 '23 at 20:17
Parsing full CSV in bash is complicated. You're best off using a real programming language that has a csv library for it. — Barmar, May 31 '23 at 20:21
Your datafile seems to be parseable, but probably needs a real programming language. It makes no sense to the viewer, but the structure is consistent. What are you trying to achieve really ? — MyICQ, May 31 '23 at 20:45
It's the returns. Whenever I try to load it in R or via something like csvtools, it adds 2 more columns that don't actually exist and substitutes the actual variables for what should be the column names. — TDeramus, May 31 '23 at 20:49
will all datalines be split across at least 2 physical lines? or could some datalines be limited to a single physical line (and thus do not need to be pasted together)? — markp-fuso, May 31 '23 at 20:54
And the structure really isn't that consistent. The number of cells within cells vary, as do the entries and strings for each entry (e.g. CAD, Result:TBA), I just had to redact A LOT of with some easy examples. The short version of what I actually need is getting unique columns for each [] and doing mean, max, mode ect for each within-bracket item but that is not something I want to deal with on this platform yet. — TDeramus, May 31 '23 at 20:55
@markp-fuso 99% of them are more along the lines of 192k lines, which vary by factors not apparent in the data. Vectorizing all that into a single line and slowly stitching it back together is not an idea I'm against. — TDeramus, May 31 '23 at 20:57
This might help as first step: `sed -n '1{ p; b }; /^[0-9]/,/"$/{ /^[0-9]/{ h; b }; H; /"$/{ x; s/\n//g; p } }' file.csv` — Cyrus, May 31 '23 at 21:12
@Cyrus The output already reads into R better, which may be enough. However, it is re-writing the whole file and making a new one, which may not be optimal for what I'm doing (e.g. having to write a separate file). Just in case it's necessary, any suggestions on how to: 1. Call an individual line? Apologies, but it's not intuitive to me in your line what needs to be changed to print each instance. 2. Remove the white space? — TDeramus, May 31 '23 at 21:39
Both GNU `sed -i` and GNU `awk -i inplace` create a new file behind the scenes with the changes and then move new file to old file. — Cyrus, May 31 '23 at 23:51
@TDeramus based on a few comments this is starting to sound like an [XY Problem](https://xyproblem.info/), ie, *how to reformat data* becomes *how to select a subset of rows from file*; I'd recommend you accept and/or upvote any of these (4) answers that address the 'reformat' question and then consider asking a new question re: *select a subset of rows from file* ... making sure to mention a) the language(s) you're using (`R`?) and b) if the resulting output needs to be reformatted — markp-fuso, Jun 01 '23 at 13:19

aborruso · Answer 1 · 2023-06-01T07:15:38.067

3

I didn't understand your goal. The CSV file seems to me a correct CSV file. If you just want to remove line breaks, you can use Miller and clean-whitespace verb:

mlr --csv clean-whitespace Malformed.csv >Malformed_c.csv

to get this https://gist.githubusercontent.com/aborruso/538e964c0c84a8b27d4c3d3b61d23bb4/raw/1fa83f43238be4a6aeb9c743aaf2e4da36f6cc74/Malformed_c.csv

edited Jun 01 '23 at 07:15

answered Jun 01 '23 at 07:09

aborruso

4,938
3
23
40

markp-fuso · Accepted Answer · 2023-05-31T22:05:54.147

Assumptions:

the only field that contains double quotes is the last field (broken_column_var)
within the last field we do not have to worry about embedded/escaped double quotes (ie, for each data line the last field has exactly two double quotes)
all broken_column_var values contain at least one embedded linefeed (ie, each broken_column_var value spans at least 2 physical lines); otherwise we need to add some code to address both double quotes residing on the same line ... doable, but will skip for now so as to not (further) complicate the proposed code

One (verbose) awk approach to removing the embedded linefeeds from broken_column_var while also replacing spaces with commas:

awk '
NR==1              { print; next }                      # print header
!in_merge && /["]/ { split($0,a,"\"")                   # 1st double quote found; split line on double quote
                     head     = a[1]                    # save 1st part of line
                     data     = "\"" a[2]               # save double quote and 2nd part of line
                     in_merge = 1                       # set flag
                     next
                   }
 in_merge          { data = data " " $0                 # append current line to "data"
                     if ( $0 ~ /["]/ ) {                # if 2nd double quote found => process "data"
                        gsub(/[ ]+/,",",data)           # replace consecutive spaces with single comma
                        gsub(/,[]]/,"]",data)           # replace ",]" with "]"
                        gsub(/[[],/,"[",data)           # replace "[," with "["
                        print head data                 # print new line
                        in_merge = 0                    # clear flag
                     }
                   }
' Malformed.csv

This generates:

SubID,Date1,date2,var1,var2,broken_column_var
000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000],[-0.00000000,-0.0000000,-0.00000000,-0.00000000,-0.0000000,-0.0000000,-0.0000000,0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.0000000]]"
000000000,1111-11-11,1111-11-11,1,SECOND TEXT FOR ZERO,"[[1.11111111,-1.11111111,-1.1111111,-1.1111111,1.1111111,1.11111111,1.11111111,1.11111111]]"
000000000,2222-22-22,2222-22-22,2,THIRD TEXT FOR ZERO,"[[-2.2222222,2.22222222,-2.22222222,-2.22222222,2.2222222,-2.22222222,-2.22222222,-2.22222222,-2.22222222,2.22222222,2.22222222,2.22222222],[-2.22222222,-2.22222222,2.22222222,2.2222222,2.22222222,-2.22222222,2.2222222,-2.2222222,2.22222222,2.2222222,2.222222,-2.22222222],[-2.22222222,-2.2222222,2.22222222,2.2222222,2.22222222,-2.22222222,-2.22222222,-2.2222222,-2.22222222,2.22222222,2.2222222,2.22222222],[-2.22222222,-2.22222222,2.2222222,2.2222222,2.2222222,-2.22222222,-2.222222,-2.2222222,-2.2222222,-2.22222222,2.22222222,2.2222222],[-2.22222222,-2.222222,2.22222222,2.22222222,2.22222222,-2.2222222,-2.2222222,-2.2222222,-2.2222222,-2.22222222,2.22222222,-2.222222],[2.22222222,-2.22222222,-2.222222,-2.222222,-2.2222222,-2.22222222,-2.222222,-2.22222222,2.2222222,-2.2222222,2.2222222,2.22222222]]"
111111111,0000-00-00,0000-00-00,00,FIRST TEXT FOR ONE,"[[-0.00000000,0.00000000,-0.00000000,0.000000,-0.00000000,-0.00000000,0.00000000,0.00000000]]"

Your assumption is correct. This seems to work perfectly! However, would you mind explaining the line(s) in it that call each instance should you have the time? That isn't immediately intuitive to me. — TDeramus, May 31 '23 at 21:42
sorry, I'm not sure what you mean by *`call each instance`*; `awk` reads every line of the input file and applies the entire code block against each line of input; the general format: ` { action_block }`; the `next` command says we're finished processing the current line of input and to skip to the next line of input (and go to the top of the script to start processing that next line of input) — markp-fuso, May 31 '23 at 21:46
`NR==1` => if the record number is 1 (ie, 1st line of file) then print it and go to the `next` line of input; `awk` treats `0` as 'false' and anything `!=0` as 'true' so ... `!in_merge` (is 'true' if `in_merge == 0`?) and `in_merge` (is 'true' if `in_merge != 0`) — markp-fuso, May 31 '23 at 21:47
Ah. I ask because the actual file is roughly +2 million "lines" (thanks to the returns) with >200k unique individual entries. As such, I think it would be easier on resource memory to call specific instances (e.g., the first 9 lines which all came from the same person) and load them into a software dataframe bit by bit than trying to read an entire csv with the returns removed. — TDeramus, May 31 '23 at 23:46
Hey @markp-fuso, so turns out some of the variables before the bracketed variable DO have quotes on them while some DON'T. I wrote some workflow to do that for EVERY variable for consistency, but now I need to adjust it to pull the values between the second instances of quotes. Would changing: { split($0,a,"\"") to { split($1,a,"\"") do it? — TDeramus, Jun 08 '23 at 19:38
would need to see actual data; having said that ... I *think* what you want is to continue to split the *line* on double quotes and then pick the correct entry from the `a[]` array; try this to see what's in each array location: `split($0,a,"\""); for (i in a) print "a[" i "]=:" a[i] ":"` (I've added visual delimiters `:` around the data so you can tell if/when an entry is blank); I *think* this should give you an idea on how to access the 'second instance' ... ? — markp-fuso, Jun 08 '23 at 22:41
picture it as the same as above, but with `,"FIRST TEXT FOR ZERO",` instead of `,FIRST TEXT FOR ZERO,` — TDeramus, Jun 09 '23 at 13:54
Also, looks like I need to save 1-4 based on the loop you provided, with 3 & 4 containing the data? — TDeramus, Jun 09 '23 at 14:42
I would suggest you ask a new question; the original question has received several answers based on a particular set of input data; by modifying the input data you're now (effectively) modifying the question which in turn is going to negate some (if not all) of the answers you've already received ... the correct approach is to ask a new question — markp-fuso, Jun 09 '23 at 16:47

glenn jackman · Answer 3 · 2023-05-31T21:05:22.807

Use double quote as the field separator. A complete record has 1 or 3 fields.

awk '
  BEGIN {FS = OFS = "\""}
  {$0 = prev $0; $1=$1}
  NF % 2 == 1 {print; prev = ""; next}
  {prev = $0}
  END {if (prev) print prev}
' file.csv

SubID,Date1,date2,var1,var2,broken_column_var
000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[ -0.00000000   0.00000000  -0.00000000  -0.00000000   0.00000000   -0.00000000  -0.00000000   0.00000000   0.00000000   0.00000000    0.00000000   0.00000000   0.00000000] [ -0.00000000  -0.0000000   -0.00000000  -0.00000000  -0.0000000   -0.0000000   -0.0000000    0.00000000   0.00000000  -0.00000000   -0.00000000   0.00000000   0.0000000 ]]"
000000000,1111-11-11,1111-11-11,1,SECOND TEXT FOR ZERO,"[[  1.11111111  -1.11111111  -1.1111111   -1.1111111    1.1111111    1.11111111   1.11111111   1.11111111]]"
000000000,2222-22-22,2222-22-22,2,THIRD TEXT FOR ZERO,"[[-2.2222222   2.22222222 -2.22222222 -2.22222222  2.2222222  -2.22222222  -2.22222222 -2.22222222 -2.22222222  2.22222222  2.22222222  2.22222222] [-2.22222222 -2.22222222  2.22222222  2.2222222   2.22222222 -2.22222222   2.2222222  -2.2222222   2.22222222  2.2222222   2.222222   -2.22222222] [-2.22222222 -2.2222222   2.22222222  2.2222222   2.22222222 -2.22222222  -2.22222222 -2.2222222  -2.22222222  2.22222222  2.2222222   2.22222222] [-2.22222222 -2.22222222  2.2222222   2.2222222   2.2222222  -2.22222222  -2.222222   -2.2222222  -2.2222222  -2.22222222  2.22222222  2.2222222 ] [-2.22222222 -2.222222    2.22222222  2.22222222  2.22222222 -2.2222222  -2.2222222  -2.2222222  -2.2222222  -2.22222222  2.22222222 -2.222222  ] [ 2.22222222 -2.22222222 -2.222222   -2.222222   -2.2222222  -2.22222222  -2.222222   -2.22222222  2.2222222  -2.2222222   2.2222222   2.22222222]]"
111111111,0000-00-00,0000-00-00,00,FIRST TEXT FOR ONE,"[[ -0.00000000   0.00000000  -0.00000000   0.000000    -0.00000000   -0.00000000   0.00000000   0.00000000]]"

For a language with a CSV library, I've found perl's Text::CSV useful for quoted newlines:

perl -e '
  use Text::CSV;
  my $csv = Text::CSV->new ({ binary => 1, auto_diag => 1 });
  open my $fh, "<:encoding(utf8)", "file.csv" or die "test.csv: $!";
  while (my $row = $csv->getline ($fh)) {
    $row->[-1] =~ s/\n//g;
    $csv->say(STDOUT, $row);
  }
'

First entry seems to read into R better, which may be enough. However, it is re-writing the whole file and making a new one, which may not be optimal for what I'm doing. Just in case it's necessary, any suggestions on how to: 1. Call an individual line? Apologies, but it's not intuitive to me what needs to be changed to print each instance. 2. Remove the white space? — TDeramus, May 31 '23 at 21:37
"Call an individual line" -- what does it mean to "call" a line? Do you mean "I want to fetch record number 1234"? If so, then using awk means you have to process all of the preceding 1233 records first. — glenn jackman, May 31 '23 at 21:59
I ask because the actual file is roughly +2 million "lines" with >200k unique entries. I think it would be easier on resource memory to call specific instances (e.g., the first 9 lines which all came from the same person) and load them into a software dataframe bit by bit than trying to read an entire csv with the returns removed. And by white space, I mean the spaces in between the numbers in the 6th column. Like: "[[1.11111111,-1.11111111,-1.1111111,-1.1111111,1.1111111,1.11111111,1.11111111,1.11111111]]" — TDeramus, May 31 '23 at 23:49

potong · Answer 4 · 2023-06-01T07:49:01.733

This might work for you (GNU sed):

sed -E '1b
        :a;N;/"$/!ba
        s/"/\n&/
        h
        s/\n/ /2g
        s/.*\n//
        s/ +/,/g
        s/,\]/]/g
        s/\[,/[/g
        H
        g
        s/\n.*\n//' file

Forget the header line.

Gather up each record.

Introduce a newline before the last field.

Make a copy of the ameliorated record.

Replace all newlines from the second with spaces.

Remove upto the first introduced newline.

Replace spaces by commas.

Remove any introduced commas after or before square brackets.

Append the last field to the copy.

Make the copy current.

Remove everything between (and including) the introduced newlines.

N.B. Expects only the last field of each record is double quoted.

Alternative:

sed -E '1b;:a;N;/"$/!ba;y/\n/ /;:b;s/("\S+) +/\1,/;tb;s/,\[/[/g;s/\],/]/g' file

score 0 · Answer 5 · answered Jun 02 '23 at 19:36

You can use GoCSV's replace command to easily strip out newlines:

gocsv replace          \
  -c broken_column_var \
  -regex '\s+'         \
  -repl ' '            \
  input.csv

That normalizes all contiguous whitespace (\s+) to a single space.

A very small Python script can also handle this:

import csv
import re

ws_re = re.compile(r"\s+")

f_in = open("input.csv", newline="")
reader = csv.reader(f_in)

f_out = open("output.csv", "w", newline="")
writer = csv.writer(f_out)

writer.writerow(next(reader))  # transfer header

for row in reader:
    row[5] = ws_re.sub(" ", row[5])
    writer.writerow(row)

Parsing a large CSV file with unusual characters, spacing, brackets, and irregular returns in bash

5 Answers5

Linked