Summary:
I need to count all unique lines in all .txt files in a HDFS instance.
Total size of .txt files ~450GB.
I use this bash command:
hdfs dfs -cat /<top-level-dir>/<sub-dir>/*/*/.txt | cut -d , -f 1 | sort --parallel=<some-number> | uniq | wc -l
The problem is that this command takes all free ram and the HDFS instance exits with code 137 (out of memory).
Question:
Is there any way I can limit the ram usage of this entire command to let's say half of what's free in the hdfs OR somehow clean the memory while the command is still running?
Update:
I need to remove | sort | because it is a merge sort implementation so O(n) space complexity.
I can use only | uniq | without | sort |.