2

I have directory containing ~300K text files that I would like to concatenate into a single file, separating the contents of each file using a newline \n. For example

file1 = 'i like apples'
file2 = 'john likes oranges'
output = 'i like apples\njohn likes oranges'

The problem is that due to the large number of files, commands like

awk '{print}' dir/* combined.txt

throw an error about the list of arguments being too long. Any quick way to get around this issue? I have been trying to find a way to use piping but have been unsuccessful so far.

The text files do not end in a \n.

Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
Orest Xherija
  • 434
  • 4
  • 18

3 Answers3

2

To avoid the long command line, you can use a shell construct such as a for loop:

for f in dir/*; do cat "$f"; printf '\n'; done > combined.txt

If the order of files in the combined file doesn't matter, you can use find instead:

find dir -type f -exec sed -s '$s/$/\n/' {} + > combined.txt

This uses find -exec to minimize the number of times the command in -exec is called, while avoiding command lines that are too long.

sed -s '$s/$/\n' replaces the end of the last line in a file with a newline; -s makes sure that the change is applied to every file when multiple are supplied as arguments.

Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
  • This is almost there, but it does not added a `\n` separator between files. Is there a way to integrate that? – Orest Xherija Aug 03 '18 at 14:59
  • @OrestXherija Do you mean an extra newline? Your example shows just a single newline between the file contents. Should there be an empty line? – Benjamin W. Aug 03 '18 at 15:02
  • Sorry, no, I meant like the example dictates. – Orest Xherija Aug 03 '18 at 15:05
  • @OrestXherija Then it should be fine, `cat file1 file2` doesn't remove any linebreaks. Unless you have files that *don't* have a newline at the end of the last line and thus [don't adhere to the POSIX standard](https://stackoverflow.com/questions/729692/why-should-text-files-end-with-a-newline). – Benjamin W. Aug 03 '18 at 15:06
  • Ah, yes, that should be the problem. They don't end in `\n`. – Orest Xherija Aug 03 '18 at 15:10
  • @OrestXherija Can you edit that into the question? It's usually not a good idea to change the requirements, but the other answer already takes it into account and I'll update mine. – Benjamin W. Aug 03 '18 at 15:11
  • @PesaThe Well, the files *should* end with newlines anyway, and since they don't currently, it'll just fix the last file as well, so that's intended. The output file would have no newline at the end otherwise. – Benjamin W. Aug 03 '18 at 15:33
  • @BenjaminW. this did it (modulo the extra `\n` but I just removed that in my code). Thanks! – Orest Xherija Aug 03 '18 at 15:39
  • @BenjaminW. Oh, my bad. I compared the outputs of our solutions on 300K files with 5 lines each. And for some reason after a lot of lines (67356), the newline is sometimes (8 times in total) skipped with your solution. Any idea why that is? :O – PesaThe Aug 03 '18 at 15:41
  • @PesaThe Hmm, not really, but 67356 looks suspiciously close to 65535 (2^16-1)... maybe some internal sed limitation? – Benjamin W. Aug 03 '18 at 15:50
  • @BenjaminW. On a different system, the first newline is skipped at around line number 62k. It might indeed be some internal limitation. – PesaThe Aug 03 '18 at 16:00
0

One good way of working around a large list of files is using find, which is pretty standard on most distros these days. Something of the sort:

find ./dir -type f -exec bash -c "cat \{\} >> combined.txt && echo '' >> combined.txt"\;

I did not test it but this should work, and has the advantage of never building an argument list containing all the files in dir

shevron
  • 3,463
  • 2
  • 23
  • 35
0

Solution with GNU Parallel:

printf '%s\0' * | parallel -0 'cat {}; echo' > combined.txt

Minor error: The combined.txt will end in \n which is not specified.

My guess is, however, that you will be I/O constrained, so Benjamin W.'s solution may be faster.

Ole Tange
  • 31,768
  • 5
  • 86
  • 104