-1

Attempting to create a scaled-down approach to reading in a LARGE (>150GB) CSV file into an R script by chunking it into smaller bits which can be read in sequentially.

Problem is, one of the column variables is kind of a nested cell, similar to:

ID1,var1,var2,var3,var4,var5,"[[intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6]]"
ID2,var1,var2,var3,var4,var5,"[[intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6]]"
ID3,var1,var2,var3,var4,var5,"[[intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6]]"

I've had some success getting just the nested cell values by running:

cat file.csv | cut -d'"' -f2

But this generates ALL the values in the final column, and I would like to just be able to sequentially call each occurrence (e.g. everything between "[[ and ]]" for one ID) and flatten it into a row/vector of some kind in a single file via >>

I tried variations of this solution: How to print a single cell in a csv file using bash script or awk

But it looks like there are some returns in there that are preventing it from being called correctly (as it all comes up with either the first line via head or just blanks).

I'm sure there's a sed, awk, or grep call that can handle this but I'm drawing a blank.

Edit: It has been brought to my attention that it is unclear what I'm asking for, the short answer is I want to extract everything between the two " for a single line/entry in the CSV.

So that I can pipe an out put like:

ID3,var1,var2,var3,var4,var5,"[[intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6]]"

to it's own file with a name containing a variation of ID3 in the name.

TDeramus
  • 77
  • 5
  • I think you're trying to chunk that one CSV file (representing a 150GB csv) into multiple files. Would it be easier to split by rows, so that one file has only so-many-rows of fully-wide unmodified data? – r2evans May 17 '23 at 18:01
  • 1
    Please clarify what the output should be given the input shown. – G. Grothendieck May 17 '23 at 18:09
  • The scan function can handle multiple lines per record if the format is entire regular. Although the new version of the question makes me think that might not be needed. – IRTFM May 17 '23 at 18:40
  • Avoid verbal descriptions. Best to show small sample input, required output from that same input, your current output, code and exact text of error messages. Precision is the key to evidence and ultimately solutions (-; . Good luck. – shellter May 17 '23 at 19:37
  • @r2evans yes, that is the idea. – TDeramus May 17 '23 at 19:41
  • If so, perhaps the shell command `split`? See https://stackoverflow.com/q/2016894/3358272. Once the file is split into multiple files, you'll need to prepend the first row from the first file to the beginning of all of the other files so that they have the header row as well. – r2evans May 17 '23 at 19:42
  • @G.Grothendieck The output should be something along the lines of: [[intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6]] or ID1,var1,var2,var3,var4,var5,"[[intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6]] – TDeramus May 17 '23 at 19:43
  • @shellter would love to but not always easy with sensitive data...... – TDeramus May 17 '23 at 19:43
  • Something I forgot to mention, the length of variable6 is not always uniform. Otherwise this would be much easier regarding @r2evans split suggestion. – TDeramus May 17 '23 at 19:44
  • 1
    You have included sample input (without confidential data, so it seems), Please edit your question to show the required output from that sample data. Responding in comments looses the ability to have correct formatting. – shellter May 17 '23 at 20:28
  • 1
    Still not clear what your desired output is. Your first example output before the “or” is identical to the input. – Jeff Y May 17 '23 at 22:43
  • Are you saying you just want the whole of the quoted field printed? When your input has 3 lines, please don't just post 1 line of output unless you want to delete lines (and if that's the case tell us the criteria to select which lines to print or delete) - if you have 3 lines of input the post the exact output you want from those 3 lines. And please don't post expected output or any other information in comments where it can't be formatted and could be missed - [edit] your question to include everything that's relevant to your question. – Ed Morton May 17 '23 at 23:43
  • @EdMorton is this clearer? – TDeramus May 18 '23 at 14:16
  • No, not clearer. Show required sample output immediately after your sample input, (label it as such). Good luck. – shellter May 18 '23 at 15:51
  • @TDeramus not really. You show 2 separate lines of output (one with the whole input line, the other with part of it), just tell us which output you want. You said `From each individual line to it's own file.` but you don't show 3 separate output files for your 3 lines of input. So it's not clear if you want 3 output files (one per line of input), or if you just want to select 1 line of input at a time to pipe the awk output for that 1 line to some other tool, nor what each line of the output file(s) should contain, or if you want something completely different. – Ed Morton May 18 '23 at 20:40
  • For all intents and purposes, lets say I want this one: ID3,var1,var2,var3,var4,var5,"[[intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6]]" Which has been updated in the initial comment. – TDeramus May 19 '23 at 13:11

2 Answers2

2

It's not clear what you're trying to do but here's a start using any awk:

$ awk -F'"' -v OFS='\t' '{
    gsub(/\[/,"")
    n = split($2,a,/][],]*/)
    for (i=1; i<n; i++) {
        print NR, a[i]
    }
}' file
1       intvar6,intvar6,intvar6
1       intvar6,intvar6,intvar6,intvar6,intvar6
1       intvar6,intvar6,intvar6,intvar6
2       intvar6,intvar6,intvar6
2       intvar6,intvar6,intvar6,intvar6,intvar6
2       intvar6,intvar6,intvar6,intvar6

Try to build on that and then ask a specific question if you get stuck.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • What about without the loop? Say if I just wanted to do it for one ID for now? – TDeramus May 17 '23 at 18:40
  • Or does the loop work for a single ID? – TDeramus May 17 '23 at 19:04
  • You don't need a loop for a single item, just print the single item, e.g. if you want the 2nd item then `print a[2]`. I think the syntax is pretty obvious so it wouldn't hurt for you to try it and then ask a new, precise question if you can't get it to do whatever it is you're trying to do. – Ed Morton May 17 '23 at 23:37
  • The following quick practice seems to be giving me what (I believe) I want, but the output is a little too large to tell: awk -F'"' -v OFS='\t' '{ gsub(/\[/,"") n = split($1,a,/][],]*/) print NR, a[1] }' file.csv But it seems to be doing it for the whole file. – TDeramus May 18 '23 at 15:02
1

For non-trivial CSV data including data with quoted fields, it's ultimately easier to use tools that actually understand the format, for example the utilities in the ever-handy csv kit package.

Example script that extracts the 7th record of your data and saves each entry in its own file:

#!/usr/bin/env bash

rm -f output*.txt
while IFS=$'\037' read -r -d $'\036' lineno record; do
    # Skip header line
    if [[ $lineno = line_number ]]; then
        continue
    fi
    printf "%s\n" "$record" > "output$lineno.txt"
done < <(csvcut -l -H -c 7 input.csv | csvformat -M $'\036' -D $'\037' -U3)

csvcut -l -H -c 7 will output records with a record number and the 7th column of the original CSV file, and the csvformat bit will format things so an ASCII record-separator character is between each record instead of a newline (So you don't have to worry about things like newlines in the record messing line-oriented tools up), and put a unit-separator character between each field instead of comma, and not quote anything. The while loop then reads a record at a time using the new delimiters and saves that in its own sequentially numbered file for later processing.

Shawn
  • 47,241
  • 3
  • 26
  • 60
  • (For anything more complicated, I'd leave out the shell completely and write a script in perl or another language (Python, which csvkit is written in, is popular) with a CSV-reading library available. Even this is getting to that point with all the field and record separator shenanigans) – Shawn May 18 '23 at 00:17
  • If you want to extract just a particular record with a given ID, adding `csvgrep` into the pipeline would be one approach. – Shawn May 18 '23 at 00:21
  • Yeah it keeps tossing this error unfortunately: CSV contains a field longer than the maximum length of 131072 characters on line 1971. Try raising the maximum with the field_size_limit parameter, or try setting quoting=csv.QUOTE_NONE. – TDeramus May 18 '23 at 14:45
  • @TDeramus That's the `-z`/`--maxfieldsize` option. – Shawn May 18 '23 at 15:33
  • Given that it's variable across lines, is there a way to remove the limit altogether? – TDeramus May 18 '23 at 15:43
  • You'd have to check the documentation – Shawn May 18 '23 at 16:10
  • I unfortunately see nothing about it there. The core issue is likely that this file should not have been written as a .csv (as I counted a little under 20k lines in a single cell for field 7) but it unfortunately is. Pushing it to bash to pull characters was the best compromise I could think of BECAUSE most csv tools can't handle this. – TDeramus May 19 '23 at 13:10
  • Scratch that. It seems to work if I give it an arbitrarily high number, but I wish there were a way to have something more objective than than. – TDeramus May 19 '23 at 13:39