-1

I'm unable to count the number of records and remove new line characters which is present in between a single record and append this file output to another code. Csv file looks like

RandomName,FileName,Date,OwnerName
"f","df",10/12/1298,"dgds"
-13,"fg
dhd
fd
f",10/22/1029,"dvg 
tr
-456
3gf"
"123","fd13",13/23/1245,"13
sdg
fsdg"
dv,"Df",12/12/3455,"adf"

Expected Output

RandomName,FileName,Date,OwnerName
"f","df",10/12/1298,"dgds"
-13,"fgdhdfdf",10/22/1029,"dvgtr-4563gf"
"123","fd13",13/23/1245,"13sdgfsdg"
dv,"Df",12/12/3455,"adf"

The file is 132GB in size.I'm using this solution-

perl -0777 -pe 's/((?:,"|(?!^)\G)[^",\n]*)\n/\1/g; s/,\n/,/' "${dir}" | wc -l

But Its throwing kernel soft lockup error. I have shell/awk/perl in my server. My File can contain-

  1. any number of records
  2. size <= 132 GB
  3. file can contains special characters.($,@,#,*,-,_,%)
  4. new line character can occur more than once in a single record.

Kindly help me in finding solution for printing the output to console and to another csv file as well. Thanks in advance.

  • Welcome to Stack Overflow. So number of records means all lines which are not null and having values right? And I believe we need not to count empty line? Could you please confirm on same? – RavinderSingh13 Feb 13 '18 at 11:26
  • python acceptable? – sKwa Feb 13 '18 at 11:28
  • There'll be no empty lines. So we need not count that. – diksha ojha Feb 13 '18 at 11:38
  • 3
    *"Solution required for printing the output to console and to another csv file as well"* You seem to be under the impression that you're my manager. Please treat this site as you would if you were being offered high-quality programming effort free of charge. – Borodin Feb 13 '18 at 11:43
  • Could you show what output you expect from that sample input? It's not clear to me from your explanation. Also, does your Perl command work correctly for this small sample? (I.e. you have a simple scaling issue - it does look like it with `perl -0777`, which will attempt to read the entire file as a single line before operating on it). – Toby Speight Feb 13 '18 at 13:21
  • 1
    Also, what characters do you consider to be "special"? – Toby Speight Feb 13 '18 at 13:22
  • See https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk for how to do it with awk. – Ed Morton Feb 13 '18 at 14:05
  • @TobySpeight Yes 0777 slurp files whole. But this solution is not working for big files. Special characters as in - ($,@,#,*,-,_,%) – diksha ojha Feb 14 '18 at 06:58
  • @sKwa I have python 2.6.6 – diksha ojha Feb 14 '18 at 06:59
  • @dikshaojha: question is **on hold**, I can't answer, but check [this](https://pastebin.com/3GUydhei) – sKwa Feb 14 '18 at 07:38
  • Reading entire files into the process tends to cause performance problems if you don't have enough memory for it. Process the file a row at a time, e.g. using a CSV library as recommended in Borodin's answer. – Toby Speight Feb 14 '18 at 10:18

1 Answers1

3

You just need to use the Text::CSV_XS module with the binary option enabled. This will allow quoted fields to contain control characters including CR and LF. The _XS suffix indicates that the module has a substantial C component, and so will provide a solution that is the the most optimum available without writing the whole thing in C

This program expects the input file as a parameter on the command line

You don't say anything about the output that you want, so I have used the Data::Dump module to display the result of using Text::CSV_XS to parse each row of your example data

use strict;
use warnings 'all';

use Data::Dump 'pp';
use Text::CSV_XS;

my ( $csv_file ) = @ARGV or die "CSV File parameter missing";

open my $fh, '<', $csv_file or die qq{Unable to open "$csv_file" for input: $!};

my $csv = Text::CSV->new( {
    binary => 1,
} );

my $num_records = 0;

while ( my $row = $csv->getline( $fh ) ) {

    print pp($row), "\n\n";

    ++$num_records;
}

printf "Total of %d %s\n\n",
        $num_records,
        $num_records == 1 ? 'record' : 'records';

output

["RandomName", "FileName", "Date", "OwnerName"]

["f", "df", "10/12/1298", "dgds"]

[-13, "fg\ndhd\nfd\nf", "10/22/1029", "dvg \ntr\n-456\n3gf"]

[123, "fd13", "13/23/1245", "13\nsdg\nfsdg"]

["dv", "Df", "12/12/3455", "adf"]

Total of 5 records
Borodin
  • 126,100
  • 9
  • 70
  • 144
  • Thank you for replying. I don't think I'll be having permissions to download these modules onto server. Please help me with some alternate solution if possible. Thanks – diksha ojha Feb 14 '18 at 07:30
  • @Borodin: one remark, total rows `$num_records -1`, because of header. – sKwa Feb 14 '18 at 07:44
  • @diksha: It's very defeatist to presume failure before you've even tried. Are you even sure that `Text::CSV` isn't already installed? Your question is on hold, so you're unlikely to get a different answer now. – Borodin Feb 14 '18 at 12:28
  • @sKwa: The OP doesn't say exactly what they want counted. I'm sure they also want something more than `print pp($row)` but we don't know what. People mostly know how to subtract one from a variable if they need to. – Borodin Feb 14 '18 at 12:32
  • @diksha: Remember that any Pure Perl solution will be much slower than an `XS` module. Surely your company would prefer a module from an identifiable author on CPAN than some random, slow code from an unidentifiable contributor on SO? – Borodin Feb 14 '18 at 12:37