0

I have a csv file which contains millions of rows. Now few of the rows contains more data then column data type can accommodate. For e.g. csv file has only two rows as shown below please not ^_ is delimiter

A^_B^_C
AA^_BB^_CC

Now assume each line can accomodate only one character so line 1 row 1 is correct but line 2 is not and I want to find out all these kinds of lines.

So I thought if I get the longest lines from csv file I will be good and I tried to following but is not helping (from longest line)

wc -L file

Please help me find the largest line/column in a csv file. Another problem is I have two delimiter so cant use cut command also.

Community
  • 1
  • 1
Umesh K
  • 13,436
  • 25
  • 87
  • 129

3 Answers3

4

You can try something like:

awk '
{
for(i=1;i<=NF;i++) 
    if (length($i) == good) { continue } 
    else {
        print "Row "NR" contained data more than " good" in a single field"
        next
    }
    print "Row "NR " is valid"
}' FS='\\^_' good="1" csv
Row 1 is valid
Row 2 contained data more than 1 in a single field

Explanation:

  • We set the Field Separator to \\^_ (note we need to escape ^ since it is a special character).
  • We iterate over each field
  • We check the length of the field with the variable (good) passed at run time.
  • If the field is good, we continue the loop to the next field and continue to do so until the last field of the row.
  • If any field has a size greater than good we print our message with row number using NR.
  • If all the fields are good then we print the message stating row is good.

If I misunderstood your question, feel free to leave a comment.

jaypal singh
  • 74,723
  • 23
  • 102
  • 147
  • hi thanks a lot for the answer. for sake of simplicity I said only one length for each column but length is variable I dont length of all columns. I just want that longest column should be visible in my output or longest line with culprits. – Umesh K Mar 19 '14 at 17:34
  • How do you define the longest column if you don't know what length to compare it with? If my acceptable length is say 10, I will report 11, 1000, 40000 and longest. You can just do `awk '{print "Row "NR" is "length($0)" characters"}' FS='\\^_' csv` to print length of each rows. – jaypal singh Mar 19 '14 at 17:38
  • 1
    +1 this is pretty good answer for a not-so-well-defined problem. – anubhava Mar 19 '14 at 17:40
2
awk -F'\\^_' -v OFS=':' '
{
    for (i=1;i<=NF;i++) {
        if (length($i) > max) {
            max = length($i)
            lineNr = NR
            line   = $0
            fldNr  =  i
            fld    = $i
        }
    }
}
END {
    print lineNr, line
    print fldNr, fld
}
' file
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 1
    +1: Now that I see your answer, the question has become more clear. Ironic, isn't it? `:P` – jaypal singh Mar 19 '14 at 20:10
  • Without the OP posting his expected output, we're all just guessing. – Ed Morton Mar 19 '14 at 20:11
  • True, but I hate the fact that you always end up making the best guess! `;)` – jaypal singh Mar 19 '14 at 20:12
  • That remains to be seen... Depends what `the largest line/column` means: the longest field, the line containing the longest field, the longest line, all the fields from the field number containing the longest field, or something completely different. – Ed Morton Mar 19 '14 at 20:14
0

Here's an answer that requires defining the column lengths in a single line file using the same delimiter as the data ( assuming different columns can have different acceptable lengths ):

1^_1^_1

Using that file (which I called clengths) and using split() in an lazy way to get indexed elements:

awk -F'\\^_' '

  NR==FNR {split($0,clen,FS); next} # store the lengths

  {
    split($0,a,FS); # put the current line in an array
    for( i in a )
    { if( length(a[i]) > clen[i] ) print "["FNR","i"] = "a[i] }
  }
' clengths data

This outputs array styled indexes for the long data as [row, col] aka [line #, field #] starting at [1,1]:

 [2,1] = AA
 [2,2] = BB
 [2,3] = CC

Everything in the output is "too big" and is indexed to make finding it again easier.

n0741337
  • 2,474
  • 2
  • 15
  • 15