Unpack and re-join fixed width text

Question

I have fixed width delimited file as follows

aaaaa003aaaaaaaaaaaaaaa
bbbbb002aaaaaaaaaa
ccccc004cccccccccccccccccccc

I need to get it in the form

aaaaa003aaaaa
aaaaa003aaaaa
aaaaa003aaaaa
bbbbb002aaaaa
bbbbb002aaaaa
ccccc004ccccc
ccccc004ccccc
ccccc004ccccc
ccccc004ccccc

My current script is in efficient for 11 million lines. How can I optimise this?

#!/bin/sh
# My first Script
echo "Unbulking"
IN=$1
OUT=$2
while IFS= read -r line;do
    HEAD=${line:0:8}
    BODY=$(echo $line | sed -r ’s/.{8}//‘)
    BODYVAR=$(echo $BODY |fold -w 5)
    for i in ${BODYVAR}
    do
        echo $HEAD$i >> $OUT
    done
done < $IN
echo "Completed"

My logic needs to be along the lines:

#take the first 8 characters of a line and assign to a str1
#take the last 3 characters of str1 and cast to a intger and assign to num1
#multiply num1 by 5 and assign to num2
#return the substring from char 8 to num2 and assign to str2
#cut str2 into chunks of 5 and assign to an array arr1
#concatenate str1 with each element of arr1
#return the arr1 as a set of new lines
#repeat for everyline in the file

Welcome to forums, so you mean you need to split all characters after 3 digits into 4-4 groups, could you please clarify on logic of getting the sample expected output in your question. — RavinderSingh13, May 07 '20 at 12:08
what happens if a line is `aaaaa003xyz` or `aaaaa003x(*16)`? — Kent, May 07 '20 at 12:15

Ed Morton · Answer 1 · 2020-05-07T14:02:11.413

Don't try to manipulate text with a shell loop as the extreme slowness you've already noticed is just one of the issues you'll have, see why-is-using-a-shell-loop-to-process-text-considered-bad-practice for that issue and see https://mywiki.wooledge.org/Quotes, https://mywiki.wooledge.org/DontReadLinesWithFor, and Correct Bash and shell script variable capitalization for some of the other issues in the script you posted.

Using any awk in any shell on every UNIX box:

$ cat tst.awk
{
    head = substr($0,1,8)
    tail = substr($0,9)
    while ( tail != "" ) {
        print head substr(tail,1,5)
        tail = substr(tail,6)
    }
}

.

$ awk -f tst.awk file
aaaaa003aaaaa
aaaaa003aaaaa
aaaaa003aaaaa
bbbbb002aaaaa
bbbbb002aaaaa
ccccc004ccccc
ccccc004ccccc
ccccc004ccccc
ccccc004ccccc

oguz ismail · Answer 2 · 2020-05-07T14:40:01.670

1

Your entire script can be translated into gawk like this:

gawk 'BEGIN {
  FPAT=".{1,5}"
  OFS=""
}
{ head = substr($0,1,8)
  $0 = substr($0,9)
  for (i=1; i<=NF; i++)
    print head, $i
}' file

edited May 07 '20 at 14:40

answered May 07 '20 at 12:23

oguz ismail

1
16
47
69

James Brown · Answer 3 · 2020-05-07T12:45:02.960

0

One for GNU awk. It split the record by string of digits and prints $1 digits and $2 in 5 char parts:

$ gawk '{
    split($0,a,/[0-9]+/,seps)
    while(length(a[2])) {
        print a[1] seps[1] substr(a[2],1,5)
        a[2]=substr(a[2],6) 
    }
}' file

Output:

aaaaa003aaaaa
aaaaa003aaaaa
aaaaa003aaaaa
bbbbb002aaaaa
bbbbb002aaaaa
ccccc004ccccc
ccccc004ccccc
ccccc004ccccc
ccccc004ccccc

Gnu awk only as it uses the fourth parameter of split(), seps.

Update: Another version:

$ awk '{
    while(p=substr($0,9,5)) {
        print substr($0,1,8) p
        $0=substr($0,1,8) substr($0,14)
    }
}'

edited May 07 '20 at 12:45

answered May 07 '20 at 12:17

James Brown

36,089
7
43
59

compact, but doesn't follow OP's requirement, OP didn't say the first part `aaaaa` doesn't contain `0-9` – Kent May 07 '20 at 12:21
Updated with another version (still didn't read the reqs, though :). – James Brown May 07 '20 at 12:46

Unpack and re-join fixed width text

3 Answers3