0

I need to create a text file that includes just the dot symbol "." on every line, repeatedly, until a specific number of lines stored in a variable, is reached. I'm now using a while loop, but those files with dots need to be around 0.5-5 million lines. Therefore, it takes a bit longer than I would like it to. Below is my current code:

j=0
while [[ $j != $length ]] 
do
  echo "." >> $file
  ((j++))
done

So my question is: Is there a more efficient way of creating a file with x number of lines that each contain the same character (or string) repeating, other than using a while loop?

Thanks,

Petrstep
  • 3
  • 1

5 Answers5

5

You can use yes and head:

yes . | head -n "$length" > "$file"

This should be dramatically much faster than repeatedly opening and closing the file to write two bytes at a time.

that other guy
  • 116,971
  • 11
  • 170
  • 194
  • 2
    The "repeatedly opening and closing" problem could also be solved by putting the redirection after `done`. – Barmar Apr 05 '22 at 19:39
  • On my system that improves runtime by 2x, while this solution improves it by ~1900x – that other guy Apr 05 '22 at 19:46
  • I wasn't arguing against your solution, I upvoted it. – Barmar Apr 05 '22 at 19:50
  • The only downside here is that `head` will not work if `$length` is an especially large value: `yes | head -n 99999999999999999999` will fail with `head: invalid number of lines: ‘99999999999999999999’: Value too large for defined data type`. Although this is definitely an edge case and probably won't be an issue – joshmeranda Apr 05 '22 at 20:04
  • The largest contemporary file systems like ZFS can not represent files larger than this method can generate on GNU/Linux, so that's not a practical problem. A bigger concern is macOS where it's limited to 4GB. – that other guy Apr 05 '22 at 21:10
2

Using dd to write to the output file (took less than 2 secs)

time yes . | dd of=dotbig.txt count=1024 bs=1048576 iflag=fullblock
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.76116 s, 610 MB/s

real    0m1.814s
user    0m0.076s
sys     0m0.686s

Count of lines

wc -l dotbig.txt
536870912 dotbig.txt

Contents sample:

head -n 3 dotbig.txt ; tail -n 3 dotbig.txt
.
.
.
.
.
.
LMC
  • 10,453
  • 2
  • 27
  • 52
0

Repetive actions in bash (eg, via a loop) are always going to be slow if simply due to the overhead of spinning up a new OS process (for each pass through the loop) for each command within the loop. In this case there's going to be an additional overhead for opening and closing the output file on each pass through the loop.

You want to look for a solution that limits the number of OS processes that need to be created/closed (and in this case limit the number of times you open/close the output file). There are going to be a lot of options depending on what software/tool/binary you want to use.

One awk idea:

awk -v len="${length}" 'BEGIN {for (i=1;i<=len;i++) print "."}' > newfile

While this does use a 'loop' within awk, we're only looking at a single OS process at the bash level, and we're only opening/closing the output file once.

markp-fuso
  • 28,790
  • 4
  • 16
  • 36
0

The most resource intensive piece of this code is the redirection (echo '.' > $file). To get around this you will want to "build" a string and redirect to $file only once rather then $length times.

j=0
while [[ $j != $length ]]
do
    builder=${builder}.
done
echo "$builder" > $file

However you are still in a loop which probably isn't the best use of resources. To get around this lets take inspiration from this answer:

printf '.\n%.0s' $(seq $length) > $file

Note that here we use $(seq $length) rather than {1..$length} since bash does not expand {1..$length} to 0 1 2 3 4 5 6 7 8 9 10 if length is 10 (see this question)

joshmeranda
  • 3,001
  • 2
  • 10
  • 24
  • 1
    one potential performance issue with `seq` is that for really large series you are going to eat up a) cpu cycles to generate the series and b) memory to store the series – markp-fuso Apr 05 '22 at 19:44
  • @markp-fuso wouldn't you have the same issues for almost any solution? they all have to create the sequence and store it in memory, or is `seq` just especially bad at it? Calling `seq` in a subprocess does mean that my solution would create an extra process which isn't ideal – joshmeranda Apr 05 '22 at 20:00
  • the other 2 answers do not use/generate a sequence and both end up being 10x faster than your answer (hint: try `length=1000000` (1 mil) and time the runs); bump `length` up to, say, 20 mil and run again and use your favorite tool to watch process/memory usage – markp-fuso Apr 05 '22 at 20:07
0

This should double the size of the file each time. Maybe it's more efficient than some of the other solutions, maybe not. File "b" will keep doubling in size until a doubling would take it over the size of length. When length is a power of 2, I think this would be pretty efficient.

let n=2
let length=1000000
echo '.' > a
cat a a > b
rm a
while [[ $((n*2)) -le $length ]]; do
  mv b a
  cat a a > b
  rm a 
  let n=n*2
done
# do something here to fill out the remaining length-n lines
Rusty Lemur
  • 1,697
  • 1
  • 21
  • 54