Optimize bash command to count all lines in HDFS txt files

Question

Summary:

I need to count all unique lines in all .txt files in a HDFS instance.

Total size of .txt files ~450GB.

I use this bash command:

hdfs dfs -cat /<top-level-dir>/<sub-dir>/*/*/.txt | cut -d , -f 1 | sort --parallel=<some-number> | uniq | wc -l

The problem is that this command takes all free ram and the HDFS instance exits with code 137 (out of memory).

Question:

Is there any way I can limit the ram usage of this entire command to let's say half of what's free in the hdfs OR somehow clean the memory while the command is still running?

Update:

I need to remove | sort | because it is a merge sort implementation so O(n) space complexity.

I can use only | uniq | without | sort |.

I'm not sure it'll solve your problem but you simplify your pipeline to `hdfs ... | sort -u -t, -k1,1 | wc -l` — Tom Fenech, Aug 12 '19 at 11:18
@TomFenech: I wouldn't take `cut` out because it reduces the amount of data to sort. — Cyrus, Aug 12 '19 at 11:32
Yeah, makes sense, in which case you could use `hdfs ... | cut -d, -f1 | sort -u | wc -l` — Tom Fenech, Aug 12 '19 at 11:37
As long as **| sort |** is used, out of memory is "guaranteed". I added some info - total size of **.txt** files is **~450GB**. Thanks for your comments. — Martin, Aug 16 '19 at 12:53
Can you estimate how the output of `... | uniq` would look like? How many lines do you expect? How long can each line be and what does a typical line look like? Also, how much free memory do you have? — Socowi, Aug 17 '19 at 18:29
There could be a way to count unique lines without having to sort. It depends on your answers to my questions from two weeks ago. Without your response I cannot help you. — Socowi, Aug 31 '19 at 10:22

Socowi · Answer 1 · 2019-08-12T13:13:03.863

Some things you can try to limit sort's memory consumption:

Use sort -u instead of sort | uniq. That way sort has a chance to remove duplicates on the spot instead of having to keep them until the end.
Write the input to a file and sort the file instead of running sort in a pipe. Sorting pipes is slower than sorting files and I assume that sorting pipes requires more memory than sorting files:
hdfs ... | cut -d, -f1 > input && sort -u ... input | wc -l
Set the buffer size manually using -S 2G. The size buffer is shared between all threads. The size specified here roughly equals the overall memory consumption when running sort.
Change the temporary directory using -T /some/dir/different/from/tmp. On many linux systems /tmp is a ramdisk so be sure to use the actual hard drive.
If the hard disk is not an option you could also try --compress-program=PROG to compress sort's temporary files. I'd recommend a fast compression algorithm like lz4.
Reduce parallelism using --parallel=N as more threads need more memory. With a small buffer too much threads are less efficient.
Merge at most two temporary files at once using --batch-size=2.

I assumed that sort was smart enough to immediately remove sequential duplicates in the unsorted input. However, from my experiments it seems that (at least) sort (GNU coreutils) 8.31 does not.
If you know that your input contains a lot of sequential duplicates as in the input generated by the following commands …

yes a | head -c 10m > input
yes b | head -c 10m >> input
yes a | head -c 10m >> input
yes b | head -c 10m >> input

… then you can drastically save resources on sort by using uniq first:

# takes 6 seconds and 2'010'212 kB of memory
sort -u input

# takes less than 1 second and 3'904 kB of memory
uniq input > preprocessed-input &&
sort -u preprocessed-input

_{Times and memory usage were measured using GNU time 1.9-2 (often installed in /usr/bin/time) and its -v option. My system has an Intel Core i5 M 520 (two cores + hyper-threading) and 8 GB memory.}

thanks for you comment. Unfortunately, the total size of all **.txt** files is huge (**~450GB**). I have to remove **| sort |** from my command. Otherwise -> out of memory. — Martin, Aug 16 '19 at 12:57
Did you really try all these things together, especially `-S …` and `-T …`? As long as you have >450GB of free *disk* space you should be able to use `sort` in a pipe. With an intermediate file you need twice as much free *disk* space. — Socowi, Aug 18 '19 at 20:23

score 2 · Answer 2 · answered Aug 12 '19 at 11:21

Reduce number of sorts run in parallel.

From info sort:

--parallel=N: Set the number of sorts run in parallel to N. By default, N is set to the number of available processors, but limited to 8, as there are diminishing performance gains after that. Note also that using N threads increases the memory usage by a factor of log N.

score 1 · Answer 3 · answered Aug 12 '19 at 12:39

it runs out of memory.

From man sort:

--batch-size=NMERGE
              merge at most NMERGE inputs at once; for more use temp files
--compress-program=PROG
              compress temporaries with PROG; decompress them with PROG -d-T, 
-S, --buffer-size=SIZE
              use SIZE for main memory buffer
-T, --temporary-directory=DIR
              use DIR for temporaries, not $TMPDIR or /tmp; multiple options
          specify multiple directories

These are the options you could be looking into. Specify a temporary directory on the disc and specify buffer size ex. 1GB. So like sort -u -T "$HOME"/tmp -S 1G.

Also as advised in other answers, use sort -u instead of sort | uniq.

Is there any way I can limit the ram usage of this entire command to let's say half of what's free in the hdfs

Kind-of, use -S option. You could sort -S "$(free -t | awk '/Total/{print $4}')".

I think you wanted to write `print int($4 / 2)` instead of just `print $4`. — Socowi, Aug 12 '19 at 12:48

Optimize bash command to count all lines in HDFS txt files

3 Answers3

Linked