0

i run a test enviroment where i created 40 000 testfiles with lorem alg. the files are between 200k and 5 MB big. I wanna modify lots of random files. I will change 5% of the lines by delete 2 lines and insert 1 line with base64 string.

the probleme is that this procedere needs to much time per file. i try to fix with copying testfile to ram and change it there, but i see a single thread that use only one full core and gawk show the most cpu work. i'm looking for some solutions, but i dont find the right advice. i think gawk could do this in one step but for big files i get a to long string when i caculate with "getconf ARG_MAX".

how can i speed this up?

zeilen=$(wc -l < testfile$filecount.txt);
    
    durchlauf=$(($zeilen/20))
    zeilen=$((zeilen-2))
    for (( c=1; c<=durchlauf; c++ ))
    do
        zeile=$(shuf -i 1-$zeilen -n 1);
        
        zeile2=$((zeile+1))
        zeile3=$((zeile2+1))
        
        string=$(base64 /dev/urandom | tr -dc '[[:print:]]' | head -c 230)
        
        if [[ $c -eq 1 ]] 
        then
        gawk -v n1="$zeile" -v n2="$zeile2" -v n3="$zeile3" -v s="$string" 'NR==n1{next;print} \
        NR==n2{next; print} NR==n3{print s}1' testfile$filecount.txt > /mnt/RAM/tempfile.tmp
        else
        gawk -i inplace -v n1="$zeile" -v n2="$zeile2" -v n3="$zeile3" -v s="$string" 'NR==n1{next; print} \
        NR==n2{next; print} NR==n3{print s}1' /mnt/RAM/tempfile.tmp
        fi
       
    done
kumpel4
  • 3
  • 2
  • Sounds as if it's CPU limited, not I/O limited. Maybe you could use [multi-threading](https://stackoverflow.com/questions/2425870/multithreading-in-bash) to process files in parallel? – David784 Sep 18 '21 at 15:28
  • 5
    gawk is not your problem. Calling gawk and other tools repeatedly in a shell loop is your problem. See [why-is-using-a-shell-loop-to-process-text-considered-bad-practice](https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice) for details. [edit] your question to show a [mcve] with concise, testable sample input and expected output and explain your requirements so we can help you. You might also want to use English variable names when posting examples so more people can understand your code. – Ed Morton Sep 18 '21 at 15:32
  • Doing it in a single pass will indeed be dramatically much faster, and can be done with small, constant size arguments. – that other guy Sep 18 '21 at 15:48
  • `{next; print}` doesn't do what you probably think it does; `next` says to skip rest of `gawk` script, go back to start of `gawk` script and process next input => `print` is never processed; this explains why `{next; loop change same fileprint}` does not generate an error ... the `loop change same fileprint` is never read/processed; I'm assuming you want to skip the current line, read the next line and continue processing from the same point in the script in which case you probably want to replace `next` with `getline`, though 'next' should be sufficient with some change in the overall logic – markp-fuso Sep 18 '21 at 16:07
  • to:Ed Morton:i thought of changing the variable names, but later i forgot.
    to:mark-fuso: I copy it from a other posting. it's to hard to understand awk for a small job. "loop change same fileprint" is a copy error - not from me. i will delete it
    – kumpel4 Sep 19 '21 at 08:48
  • mark-fuso show the hole problem i described – kumpel4 Sep 19 '21 at 10:00
  • updated the answer with a piece of code (`gen_numbers()` function) to insure we don't generate any consecutive line numbers (to be deletd); once a working copy of the new script is functioning as desired, and assuming you want to further reduce overall run time (to process 40K files), you could look at ideas to parallelize the new script; plenty of SO answers on 'parallel' scripting but if you have issues then consider asking a new question re: parallelize operations – markp-fuso Sep 19 '21 at 14:09

3 Answers3

0

I don't know what the rest of your script is doing but below will give you the idea how to vastly improve it's performance.

Instead of this which calls base64, tr, head, and awk on each iteration of the loop with all of the overhead that implies:

for (( c=1; c<=3; c++ ))
do
    string=$(base64 /dev/urandom | tr -dc '[[:print:]]' | head -c 230)
    echo "$string" | awk '{print "<" $0 ">"}'
done
<nSxzxmRQc11+fFnG7ET4EBIBUwoflPo9Mop0j50C1MtRoLNjb43aNTMNRSMePTnGub5gqDWeV4yEyCVYC2s519JL5OLpBFxSS/xOjbL4pkmoFqOceX3DTmsZrl/RG+YLXxiLBjL//I220MQAzpQE5bpfQiQB6BvRw64HbhtVzHYMODbQU1UYLeM6IMXdzPgsQyghv1MCFvs0Nl4Mez2Zh98f9+472c6K+44nmi>
<9xfgBc1Y7P/QJkB6PCIfNg0b7V+KmSUS49uU7XdT+yiBqjTLcNaETpMhpMSt3MLs9GFDCQs9TWKx7yXgbNch1p849IQrjhtZCa0H5rtCXJbbngc3oF9LYY8WT72RPiV/gk4wJrAKYq8/lKYzu0Hms0lHaOmd4qcz1hpzubP7NuiBjvv16A8T3slVG1p4vwxa5JyfgYIYo4rno219ba/vRMB1QF9HaAppdRMP32>
<K5kNgv9EN1a/c/7eatrivNeUzKYolCrz5tHE2yZ6XNm1aT4ZZq3OaY5UgnwF8ePIpMKVw5LZNstVwFdVaNvtL6JreCkcO+QtebsCYg5sAwIdozwXFs4F4hZ/ygoz3DEeMWYgFTcgFnfoCV2Rct2bg/mAcJBZ9+4x9IS+JNTA64T1Zl+FJiCuHS05sFIsZYBCqRADp2iL3xcTr913dNplqUvBEEsW1qCk/TDwQh>

you should write this which only calls each tool once and so will run orders of magnitude faster:

$ base64 /dev/urandom | tr -dc '[[:print:]]' |
    gawk -v RS='.{230}' '{print "<" RT ">"} NR==3{exit}'
<X0If1qkQItVLDOmh2BFYyswBgKFZvEwyA+WglyU0BhqWHLzURt/AIRgL3olCWZebktfwBU6sK7N3nwK6QV2g5VheXIY7qPzkzKUYJXWvgGcrIoyd9tLUjkM3eusuTTp4TwNY6E/z7lT0/2oQrLH/yZr2hgAm8IXDVgWNkICw81BRPUqITNt3VqmYt/HKnL4d/i88F4QDE0XgivHzWAk6OLowtmWAiT8k1a0Me6>
<TqCyRXj31xsFcZS87vbA50rYKq4cvIIn1oCtN6PJcIsSUSjG8hIhfP8zwhzi6iC33HfL96JfLIBcLrojOIkd7WGGXcHsn0F0XVauOR+t8SRqv+/t9ggDuVsn6MsY2R4J+mppTMB3fcC5787u0dO5vO1UTFWZG0ZCzxvX/3oxbExXb8M54WL6PZQsNrVnKtkvllAT/s4mKsQ/ojXNB0CTw7L6AvB9HU7W2x+U3j>
<ESsGZlHjX/nslhJD5kJGsFvdMp+PC5KA+xOYlcTbc/t9aXoHhAJuy/KdjoGq6VkP+v4eQ5lNURdyxs+jMHqLVVtGwFYSlc61MgCt0IefpgpU2e2werIQAsrDKKT1DWTfbH1qaesTy2IhTKcEFlW/mc+1en8912Dig7Nn2MD8VQrGn6BzvgjzeGRqGLAtWJWkzQjfx+74ffJQUXW4uuEXA8lBvbuJ8+yQA2WHK5>
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
0

Assumptions:

  • generate $durchlauf (a number) random line numbers; we'll refer to a single number as n ...
  • delete lines numbered n and n+1 from the input file and in their place ...
  • insert $string (a randomly generated base64 string)
  • this list of random line numbers must not have any consecutive line numbers

As others have pointed out you want to limit yourself to a single gawk call per input file.

New approach:

  • generate $durchlauf (count) random numbers (see gen_numbers() function)
  • generate $durchlauf (count) base64 strings (we'll reuse Ed Morton's code)
  • paste these 2 sets of data into a single input stream/file
  • feed 2 files to gawk ... the paste result and the actual file to be modified
  • we won't be able to use gawk's -i inplace so we'll use an intermediate tmp file
  • when we find a matching line in our input file we'll 1) insert the base64 string and then 2) skip/delete the current/next input lines; this should address the issue where we have two random numbers that are different by +1

One idea to insure we do not generate consecutive line numbers:

  • break our set of line numbers into ranges, eg, 100 lines split into 5 ranges => 1-20 / 21-40 / 41-60 / 61-80 / 81-100
  • reduce the end of each range by 1, eg, 1-19 / 21-39 / 41-59 / 61-79 / 81-99
  • use $RANDOM to generate numbers between each range (this tends to be at least a magnitude faster than comparable shuf calls)

We'll use a function to generate our list of non-consecutive line numbers:

gen_numbers () {

max=$1                             # $zeilen     eg, 100
count=$2                           # $durchlauf  eg, 5

interval=$(( max / count ))        # eg, 100 / 5 = 20

for (( start=1; start<max; start=start+interval ))
do
        end=$(( start + interval - 2 ))

        out=$(( ( RANDOM % interval ) + start ))
        [[ $out -gt $end ]] && out=${end}

        echo ${out}
done
}

Sample run:

$ zeilen=100
$ durchlauf=5
$ gen_numbers ${zeilen} ${durchlauf}
17
31
54
64
86

Demonstration of the paste/gen_numbers/base64/tr/gawk idea:

$ zeilen=300
$ durchlauf=3
$ paste <( gen_numbers ${zeilen} ${durchlauf} ) <( base64 /dev/urandom | tr -dc '[[:print:]]' | gawk -v max="${durchlauf}" -v RS='.{230}' '{print RT} FNR==max{exit}' ) 

This generates:

74      7VFhnDN4J...snip...rwnofLv
142     ZYv07oKMB...snip...xhVynvw
261     gifbwFCXY...snip...hWYio3e

Main code:

tmpfile=$(mktemp)

while/for loop ... # whatever OP is using to loop over list of input files
do
    zeilen=$(wc -l < "testfile${filecount}".txt)
    durchlauf=$(( $zeilen/20 ))

    awk '

    # process 1st file (ie, paste/gen_numbers/base64/tr/gawk)

    FNR==NR        { ins[$1]=$2                 # store base64 in ins[] array
                     del[$1]=del[($1)+1]        # make note of zeilen and zeilen+1 line numbers for deletion
                     next
                   }

    # process 2nd file

    FNR in ins     { print ins[FNR] }           # insert base64 string?

    ! (FNR in del)                              # if current line number not in del[] array then print the line

    ' <( paste <( gen_numbers ${zeilen} ${durchlauf} ) <( base64 /dev/urandom | tr -dc '[[:print:]]' | gawk -v max="${durchlauf}" -v RS='.{230}' '{print RT} FNR==max{exit}' )) "testfile${filecount}".txt > "${tmpfile}"

    # the last line with line continuations for readability:
    #' <( paste \
    #         <( gen_numbers ${zeilen} ${durchlauf} ) \
    #         <( base64 /dev/urandom | tr -dc '[[:print:]]' | gawk -v max="${durchlauf}" -v RS='.{230}' '{print RT} FNR==max{exit}' ) \
    #   ) \
    #"testfile${filecount}".txt > "${tmpfile}"

    mv "${tmpfile}" "testfile${filecount}".txt

done

Simple example of awk code in action:

$ cat orig.txt
line1
line2
line3
line4
line5
line6
line7
line8
line9

$ cat paste.out           # simulated output from paste/gen_numbers/base64/tr/gawk
1 newline1
5 newline5

$ awk '...' paste.out orig.txt
newline1
line3
line4
newline5
line7
line8
line9
markp-fuso
  • 28,790
  • 4
  • 16
  • 36
  • thanks for your hint. my post is over yours - i dont kow why - now it's followed - the tool is funny – kumpel4 Sep 19 '21 at 09:47
0

@mark-fuso, Wow, thats incredibly fast! But there is a mistake in the script. The file grows in size a little bit, which is something I have to avoid. I think if two random line numbers ($durchlauf) are following each other, then one line is not deleted. Honestly, I dont completely understand what your command is doing, but it works very well. I think for such a task, I have to have more bash experience.

Sample output:

64
65
66
gOf0Vvb9OyXY1Tjb1r4jkDWC4VIBpQAYnSY7KkT1gl5MfnkCMzUmN798pkgEVAlRgV9GXpknme46yZURCaAjeg6G5f1Fc7nc7AquIGnEER>
AFwB9cnHWu6SRnsupYCPViTC9XK+fwGkiHvEXrtw2aosTGAAFyu0GI8Ri2+NoJAvMw4mv/FE72t/xapmG5wjKpQYsBXYyZ9YVV0SE6c6rL>
70
71
ouflak
  • 2,458
  • 10
  • 44
  • 49
kumpel4
  • 3
  • 2
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-ask). – Community Sep 19 '21 at 09:50
  • This isn't an answer, it should be a comment and if you need to provide formatted text like sample output then edit your question to include it. – Ed Morton Sep 19 '21 at 11:59
  • i try to understand what your command do. consequent question: what makes del[$1]=del[($1)+1] ???? ; in $1 is the current $shuf; del is an array – kumpel4 Sep 19 '21 at 14:12
  • from your sample output I'm assuming the 2x line numbers (to delete) were `67` and `68`; for input `67` the script will delete lines `67` and `68` and insert `gOf0V...`; for input `68` the script will delete lines `68` and `69` and insert `AFwB9...`; combined that means lines `67`, `68` and `69` are deleted, which is what your output is showing; your script recounts the numbers of lines for each pass while my script only counts lines once; in a worst case scenario with 20x consecutive line numbers my script will delete 21 lines while your script will delete 40 lines ... – markp-fuso Sep 19 '21 at 14:13
  • I've updatd the answer to insure we don't generate any consecutive line numbers (by calling the `gen_numbers()` function); if this doesn't address the issue of `file grows in size a little bit` then we'll need more details ... – markp-fuso Sep 19 '21 at 14:14
  • `$1` is the first field in the input line, in this case the `shuf/line number`; yes, `del[]` is an array; if `$1` = `36` then we'll be creating array entries `del[36]` and `del[37]`; `awk` allows for some shortcuts like not assigning a value to an array, or assigning the same value to multiple variables, eg, `a=b=c=1` is comparable to `a=1; b=1; c=1`; so `del[$1]=del[($1)+1]` is comparable to `del[$1]; del[($1)+1]` – markp-fuso Sep 19 '21 at 14:19
  • thanks markp-fuso - thats a good idea. i test an other way. the file arent allowed to grow but it can shrink. i test both and your function need much more time so i think i will go with the addition del[$1]=del[($1)+2]. thanks a lot for your help. I appreciate that a lot. – kumpel4 Sep 19 '21 at 14:47
  • that change - `del[$1]=del[($1)+2]` - doesn't address the issue of multiple concurrent line numbers), eg, for 4 input line numbers, eg, `1,2,3,4` this change will create array entires for `1,2,3,4,5,6` ... so 6 lines, not 8, will be deleted; you need to address the 'consecutive' line number issue at the point where the numbers are being generated; having said that ... I'm not sure I understand your comment 'your function need much more time` ... would need more details on how being measured ... ?? – markp-fuso Sep 19 '21 at 14:59
  • fwiw, updated the `gen_numbers()` function to replace `shuf` with `$RANDOM`; a bit faster than comparable `shuf` calls – markp-fuso Sep 19 '21 at 15:57
  • add on: sometime i think while work on big files the error message kill the script:`-bash: start_pipeline: pgrp pipe: Too many open files -bash: echo: write error: Bad file descriptor -bash: start_pipeline: pgrp pipe: Too many open files -bash: cannot make pipe for process substitution: Too many open files -bash: start_pipeline: pgrp pipe: Too many open files` – kumpel4 Sep 19 '21 at 16:59
  • I suggest you create a new question for the `pgrp` issue; there are likely a couple options to get around the issue but that should be addressed with a new question; at this point this Q&A thread is getting a bit off topic from the original issue – markp-fuso Sep 19 '21 at 17:09
  • mayby u are right – kumpel4 Sep 19 '21 at 17:57