2

Im not of a programer myself but developed a shellscript to read a positional file and based on a single letter specified at position 16 copy all the line to another file.

Exemple:

INPUT FILE
003402841000011A10CNPJ08963394000195
003402841000041B20CNPJ08963394000195 16012020XX5313720087903007 003402841000011A10CNPJ08963394000195
003402841000041B20CNPJ08963394000195 16012020XX5313720087903007

OUTPUT FILE A
003402841000011A10CNPJ08963394000195
003402841000011A10CNPJ08963394000195

OUTPUT FILE B
003402841000041B20CNPJ08963394000195 16012020XX5313720087903007 003402841000041B20CNPJ08963394000195 16012020XX5313720087903007

The code i current have:

#!/usr/bin/env bash

ARQ_IN="$1";
DIR_OUT="C:/Users/etc/etc/";

while IFS= read -r line || [[ -n "$line" ]]; 
do 

SUBSTRING=$(echo $line| cut -c16);

if [ $SUBSTRING == "A" ]
then
    echo "$line" >> "$DIR_OUT"arqA.txt;
else
    if [ $SUBSTRING == "B" ]
    then
        echo "$line" >> "$DIR_OUT"arqB.txt;
    else
        if [ $SUBSTRING == "K" ]
        then
            echo "$line" >> "$DIR_OUT"arqK.txt;
        else
            if [ $SUBSTRING == "1" ]
            then
                echo "$line" >> "$DIR_OUT"arq1.txt;
            else
            
            fi
        fi
    fi
fi


done < "$ARQ_IN"

Although this code works, it doesn't work in the speed that i need, the INPUT FILE has around 400k registers.

Can someone help me to write a new code or improve this one?

RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
  • 2
    You are right that you are looking for a different approach, like `awk`. Just for the learning process 2 remarks: In `SUBSTRING=$(echo $line| cut -c16)` you should use quotes around `$line`. You can often avoid nested if-then-else blocks with `case ... esac`. – Walter A Dec 02 '20 at 20:56
  • 1
    Probably not relevant in your input data, but if $SUBSTRING is a space, `[ $SUBSTRING == x ]` will throw an error. Always quote your variables, unless you know exactly when not to. – glenn jackman Dec 03 '20 at 03:11
  • 1
    Get out of the habit of using ALLCAPS variable names, leave those as reserved by the shell. One day you'll write `PATH=something` and then [wonder why](https://stackoverflow.com/q/27555060/7552) [your script is broken](https://stackoverflow.com/q/28310594/7552). – glenn jackman Dec 03 '20 at 03:12

2 Answers2

4

This is a job for awk, could you please try following, though I haven't tested it with huge dataset but it should be definitely faster than OP's current approach. To add abosulte path before output file name we could pass shell variable into awk variable and get it in outputFile variable here.

awk '
{
  close(outputFile)
  outputFile=("output_file_"substr($0,16,1))
  print >> (outputFile)
}
' Input_file

With complete folder path to save files use following, please change /tmp/test/ with your actual path here.

DIR_OUT="/tmp/test/"
awk -v folder="${DIR_OUT}" '
{
  close(outputFile)
  outputFile=(folder"arq"substr($0,16,1)".txt")
  print >> (outputFile)
}
' Input_file
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
  • 1
    holy.. this worked perfectly. But i dont quite understand the logic, can you teach me? Also in the Input_File i would like to add something like ... }' "$DIR_IN""$ARQ_IN" (file directory plus file name) because this script wont stay in the same folder as the input file – Icaro Americo Dec 03 '20 at 16:52
  • @IcaroAmerico, your welcome. Usually I add detailed explanations in solutions but it was late night so I slept after adding this :) let me add some in comments, don't want to edit it as of now, cheers. Will add comments in few mins or so. – RavinderSingh13 Dec 03 '20 at 16:56
  • @IcaroAmerico, Here is the explanation for above solution. `close(outputFile)`-->closing outputFile(output file) in backend here to avoid `too many opened files error`. `outputFile=("output_file_"substr($0,16,1))`--> Creating outputFile which has output file name along with substrig of current line's 17th character. `print >> (outputFile)` printing current line into output file. I hope this helps you, cheers. – RavinderSingh13 Dec 03 '20 at 17:04
2

Yes, bash while-read loops can be pretty slow, plus there's no need to call out to cut to get a substring. Try this:

while IFS= read -r line || [[ -n "$line" ]]; do 
    # the offset is zero-based, so use 15 not 16
    letter=${line:15:1}
    case "$letter" in
        [ABK1]) echo "$line" >> "${DIR_OUT}arq${letter}.txt" ;;
    esac
done < "$ARQ_IN"

With cascading if-else if, use elif

if some condition; then
    some action
elif some other condition; then
    some other action
...
else
    some default action
fi
glenn jackman
  • 238,783
  • 38
  • 220
  • 352