0

I have around 250,000 files whose file names look like: read_\d\d.fasta

I get the argument is too long error when running cat *.fasta > all.fasta command.

Normally I use a for loop or find when I get the argument is too long error.

How can I use a for loop (or any other method) to concatenate this number of files?

I have tried for i in read*fasta ; do cat $i >> combined.$i ; done However this doesn't concatenate the files.

I have looked at other answers when looking as this error, however I don't see how for/find can be used here.

James Risner
  • 5,451
  • 11
  • 25
  • 47
SaltedPork
  • 345
  • 3
  • 16
  • 3
    `for` solution: `for f in *.fasta ; do cat "$f"; done > all.fasta` – Wiimm Feb 01 '23 at 11:29
  • 2
    `gnu find` solution: `find . -type f -name '*.fasta' -exec cat {} + > all.fasta`. Add options `-mindepth` and/or `-maxdepth`to limit directory depth. – Wiimm Feb 01 '23 at 11:32
  • 1
    `find` solution: `find . -type f -name '*.fasta' -print0 | xargs -0 cat > all.fasta` – Wiimm Feb 01 '23 at 11:34
  • See [Argument list too long error for rm, cp, mv commands](https://stackoverflow.com/q/11289551/4154375). – pjh Feb 01 '23 at 12:54
  • Keep in mind the race condition that allows `all.fasta` to be treated as a result of `*.fasta`, which could result in an infinite loop that fills your file system. – chepner Feb 01 '23 at 16:00

2 Answers2

1

Try

printf '%s\0' *.fasta | xargs -0 cat -- > all.fasta
  • Using NUL characters as path delimiters (\0 in the printf format string, -0 option for xargs) means that this will work for arbitrary filenames, including ones that contain newlines.
  • The -- after cat means that it will work even if some of the files have names that begin with -.
  • printf avoids the "argument is too long" error because it is a shell built-in. Unlike external commands such as cat, built-ins do not use the exec system call, and the argument length limitation is in "exec".
pjh
  • 6,388
  • 2
  • 16
  • 17
  • I would expect `printf` to have the same problem, as it's another simple command that receives each result from `*.fasta` as a separate argument. – chepner Feb 01 '23 at 15:57
  • @chepner, `printf` doesn't have the problem because it is a shell built-in. The limitation is associated with the "exec" system call, which is used to run a `cat` command but not `printf`. As always, I tested the code before posting it. – pjh Feb 01 '23 at 17:31
  • I assumed you did, but I would consider this an undocumented optimization, not something to rely on. `printf`, built-in or not, is still part of a *simple* command, and should be assumed to have the same semantics as any other command. (It's not a special *compound* command like `for` or `[[ ... ]]`.) – chepner Feb 01 '23 at 17:36
  • @chepner, I disagree. `printf` is documented as a Bash built-in. "exec" is not used for built-ins. I think the code is safe in all versions of Bash, and always will be. – pjh Feb 01 '23 at 17:40
  • Ah, ok, sorry. I was confusing `printf` not needing `exec` with `printf` expanding the glob itself (which is, I think, what `for f in *.fasta` does). – chepner Feb 01 '23 at 17:41
  • @chepner, sorry for my lack of clarity. I've added a note to my answer to try to improve it. Thanks for taking the time to point out the issue. – pjh Feb 01 '23 at 18:03
0
find . -name read_\[0-9\]\[0-9\].fasta -print0 | xargs -0 cat > all.fasta

This combination of find and xargs should safely find these files without exceeding execv() limits and avoid an infinite loop.

  • The . for find should be your directory of 250k files.
  • The -name matches your pattern, specifically "read_\d\d.fasta".
  • The print0 makes the output null terminated and therefore safe.
  • Calling xargs with -0 receives null terminated and calls cat on each.

This works in Linux and FrreBSD (presumably also other *BSD).

James Risner
  • 5,451
  • 11
  • 25
  • 47