2

I have an ndjson file (every line a valid json) with information that I want to run through a topic model. As it happens, this data is sorted a) by user, and b) by time, so the structure of the overall ndjson file is far from random. However, for the purpose of pushing this file (which I will later chunk into n smaller ndjsons with 50k lines each) through the topic model, I need the features that appear in the data to all have the same probability of being in any given line. My idea to achieve this is to randomly re-order all the lines in the file. The file I'm working with has 11502106 lines and has an overall file size of 45 GB. I also have a gzipped version of the file, which is approximately 4 GB large.

My idea for solving this problem was to use Ubuntu's built-in shuf function, to extract the same number of lines as in the original file, and to direct the output to a new file. I did this like this:

nohup shuf -n 11502106 original_file.json > new_file_randomorder.json &

However, this process gets killed by the system after running for approx. 5 minutes. I'm guessing that I'm running out of memory (16 GB memory is available in my machine). I'm running Ubuntu 16.04.

I realise that this could be a very complicated task, given the fact that file size > available memory.

If anyone has any ideas or potential solutions to this problem, that would be greatly appreciated! Thanks a lot in advance!

nikUoM
  • 639
  • 1
  • 8
  • 18
  • You need an index, man... – Severin Pappadeux Jul 28 '18 at 01:32
  • hi, thanks for your comment @SeverinPappadeux - could you perhaps elaborate? – nikUoM Jul 28 '18 at 01:59
  • Well, you could try to build index first using answer by Sanjay Manohar from https://stackoverflow.com/questions/6022384/bash-tool-to-get-nth-line-from-a-file. He shows how to build index using `awk`, and it should work even if it takes time, because GNU awk doesn't read file in memory. After getting index file, it would be relatively small, you could shuffle it. And then again using code from Sanjay Manohar you could start getting shuffled data by accessing original data and shuffled index. – Severin Pappadeux Jul 28 '18 at 02:03
  • Basically at the end do something like `tail -c +$(awk 'NR==1' shuffled.idx) BIGJSON | head -1`, then `tail -c +$(awk 'NR==2' shuffled.idx) BIGJSON | head -1`, then line 3, ... – Severin Pappadeux Jul 28 '18 at 02:10
  • I quickly checked it, seems to work but I only tried small file, YMMV – Severin Pappadeux Jul 28 '18 at 02:17
  • How about GNU `sort -R`? GNU sort supports file backed mergesort of files significantly larger than RAM – that other guy Jul 28 '18 at 04:35

2 Answers2

1

GNU sort has an -R option for shuffling. This might be convenient but I believe it uses an n*ln(n) algorithm.

On some systems, GNU sort is (or can be made) available as gsort.

Here are some details from a run using a 55MB input file with 15631278 lines. The times are in seconds. The -R option restrains the use of RAM.

# /usr/bin/time -lp gsort -S 5M -R < input.txt > /tmp/shuffled.txt
user        98.45
sys          1.05
  14118912  maximum resident set size
peak
  • 105,803
  • 17
  • 152
  • 177
  • 1
    hi @peak - thanks for your answer. i ran this for approx 2 days, and it worked. thanks so much! – nikUoM Jul 31 '18 at 01:52
  • I don't know why the answer mentioning terashuf was voted down, but please note that aside from speed, it does have the advantage of handling identical lines properly. – peak Jul 31 '18 at 02:16
0

Try terashuf - a C++ application. See https://github.com/alexandres/terashuf

peak
  • 105,803
  • 17
  • 152
  • 177