-1

I'm parsing through a file - first thing I do is concatenate the first three fields and prepend them to each record. Then I want to scrub the data of any colons, single quotes, double quotes or backslashes. Following is how I'm doing it, but is there a way for me to do it using the $line variable that would be more efficient?

# Read the lines one by one.
while($line = <$FH>) {

# split the fields, concatenate the first three fields,
# and add it to the beginning of each line in the file
    chomp($line);
    my @fields = split(/,/, $line);
    unshift @fields, join '_', @fields[0..2];

# Scrub data of characters that cause scripting problems down the line.
        $_ =~ s/:/ /g for @fields[0..39];
        $_ =~ s/\'/ /g for @fields[0..39];
        $_ =~ s/"/ /g for @fields[0..39];
        $_ =~ s/\\/ /g for @fields[0..39];
BigRedEO
  • 807
  • 4
  • 13
  • 33
  • You could do the substitutions before creating the array of elements. Another thing would be to use a character set for the substitution: `s/[:\'"\\]//g` instead of the 4 substitutions – LaintalAy Apr 28 '16 at 13:56
  • 2
    You should probably reverse your logic here, i.e. for each field, apply all of these substitutions. I think the *right* answer here though is you should be using a module like [`Text::CSV_XS`](https://metacpan.org/pod/Text::CSV_XS) and then you wouldn't need to do any sanitation. – Hunter McMillen Apr 28 '16 at 13:57
  • @HunterMcMillen Unfortunately, no Text modules available to me, nor will they be made available to me. – BigRedEO Apr 28 '16 at 14:07
  • 1
    They don't need to be "made available". You managed to install your script, so you can manage to install those scripts called modules too. – ikegami Apr 28 '16 at 14:53
  • @ikegami - I do not have privileges on the server with which I'm working to download any Perl modules. And those privileges will not be made available to me. – BigRedEO Apr 28 '16 at 15:00
  • 2
    Either you have priviledge to put Perl code on it or you don't, so you have sufficient priviledges to install a Perl module. – ikegami Apr 28 '16 at 15:03
  • 1
    @BigRedEO [You don't need root/admin privileges to install modules.](http://stackoverflow.com/q/3735836/176646) You make your life much more difficult by not using modules. Don't reinvent the wheel if you don't have to. – ThisSuitIsBlackNot Apr 28 '16 at 15:14
  • Unfortunately, I do in order for this Server to connect outside our own network and I've been told "No - you're not to add any modules." Which is fine, because I'm learning a lot of basics (having barely touched Perl once about 8 years ago) and making my script work without it. – BigRedEO Apr 28 '16 at 15:20
  • 1
    @HunterMcMillen: *"you should be using a module like Text::CSV_XS and then you wouldn't need to do any sanitation"* I wish that were true, but `Text::CSV` is too often seen to be a cure-all. The OP is deleting single and double quotes, colons and backslashes from the data, and the module won't do anything like that. Once in a while a CSV file comes along that is best parsed with `Text::CSV`, and it is usually the output from Microsoft Excel. The rest of the time, a simple `chomp` followed by `split /,/` is the far better option – Borodin Apr 28 '16 at 17:18
  • Firstly, if you're not using CPAN modules then you're cutting yourself off from most of Perl's power. That's a problem that you should spend some effort fixing. Secondly, [Text::ParseWords](perl11.org ) is part of the standard Perl distribution (which means no-one needs to install it) and may well help you here. – Dave Cross Apr 29 '16 at 09:35

2 Answers2

2

What would be cleaner for me:

while($line = <$FH>) {
    chomp($line);

    $line =~ s/[:\'"\\]/ /g;

    my @fields = split(/,/, $line);
    unshift @fields, join '_', @fields[0..2];
}

And as @HunterMcMillen said, if this is a standard CSV file it would be better to use a parsing module. It will be easier down the road.

LaintalAy
  • 1,162
  • 2
  • 15
  • 26
  • That one would kill of a " in the middle of any field. – Sebastian Apr 28 '16 at 14:05
  • @Sebastian - It will get rid of any " character. That's what he's doing.... or am I missing your point? – LaintalAy Apr 28 '16 at 14:08
  • eballes - No Text modules available for me (and they won't be made available either). @Sebastian - Unclear what you're saying here. – BigRedEO Apr 28 '16 at 14:09
  • `1,2,foo "bar" baz,"don't try this at home"` would become `1, 2, foo bar baz, dont try this at home` - your solution on $line would basically drop all special chars from anywhere in the line - even in the middle of a field (column). – Sebastian Apr 28 '16 at 14:22
  • @Sebastian - that is exactly what I want. Some of these fields/columns are from user input and they put all these characters in that throw off other programs that will be using this data. Going to try this out eballes. – BigRedEO Apr 28 '16 at 14:24
  • @eballes - cr*p - just discovered I can't scrub the colons for the line because there are two DATETIME fields at the very end where I need to keep the colons, so I only need to delete the other three, then will have to delete the colons in all but those last two fields. – BigRedEO Apr 28 '16 at 15:01
  • 2
    `@BigRedEO`: You're not describing your problem properly. People are coming up with code that implements their best at what you mean, and you say "Yeah but I have this too...". Imagine what @eballes is feeling.That's not a good way to get an answer. It sounds like you need to split every record into fields and specify a transformation for each one., and I think you must open a new question if you can't make progress yourself – Borodin Apr 28 '16 at 17:54
1

I am certain that I have seen a very similar question before but my simple searches won't find it. What stands out is adding a new field before all of the rest that is a function of the original values

You've described that best in Perl terms

unshift @fields, join '_', @fields[0..2];

so the only step left is the removal of rogue characters—single and double quotes, colons, and backslashes

Your code seems to work fine. The only changes I would make would be

  • Use the default variable $_ properly. I think this is what newcomers hate most about Perl, and then come to love most once they understand it

  • Use tr///d instead of s///. It may add a little speed, but most of all frees you from regex syntax when you just want to say what characters to delete and need something simpler

I think this should do what you need

use strict;
use warnings 'all';

while ( <DATA> ) {

    chomp;
    my @fields = split /,/;

    unshift @fields, join '_', @fields[0..2];

    tr/:"'\\//d for @fields; # Delete colons, quotes, and backslash

    print join(',', @fields), "\n";
}

__DATA__
a:a,b"bb",c'ccc',ddd,e,f,g,h

output

aa_bbb_cccc,aa,bbb,cccc,ddd,e,f,g,h
Borodin
  • 126,100
  • 9
  • 70
  • 144
  • There will be colons in the last two fields of every record that must STAY in the record, hence my separating the colons scrub so that the last two fields are untouched. How would that change your answer above? – BigRedEO Apr 28 '16 at 17:12
  • 3
    @BigRedEO: Thanks. But remember that no one knew that your record was longer than 40 fields. You've only now revealed that you weant different treatment for the 41st onward. That's one reason why it's important to show sample input data and the corresponding required output data. I don't think you're going to get far with this statement of your problem – Borodin Apr 28 '16 at 17:50