1

I'm working on a machine translation project in which I have 4.5 million lines of text in two languages, English and German. I would like to shuffle these lines prior to dividing the data into shards on which I will train my model. I know the shuf command described here allows one to shuffle lines in one file, but how can I ensure that corresponding lines in the second file are also shuffled into the same order? Is there a command to shuffle lines in both files?

Vivek Subramanian
  • 1,174
  • 2
  • 17
  • 31

1 Answers1

2

TL;DR

  • paste to create separate columns from two files into a single file
  • shuf on the single file
  • cut to split the columns

Paste

$ cat test.en 
a b c
d e f
g h i

$ cat test.de 
1 2 3
4 5 6
7 8 9

$ paste test.en test.de > test.en-de

$ cat test.en-de
a b c   1 2 3
d e f   4 5 6
g h i   7 8 9

Shuffle

$ shuf test.en-de > test.en-de.shuf

$ cat test.en-de.shuf
d e f   4 5 6
a b c   1 2 3
g h i   7 8 9

Cut

$ cut -f1 test.en-de.shuf> test.en-de.shuf.en
$ cut -f2 test.en-de.shuf> test.en-de.shuf.de

$ cat test.en-de.shuf.en 
d e f
a b c
g h i

$ cat test.en-de.shuf.de
4 5 6
1 2 3
7 8 9
alvas
  • 115,346
  • 109
  • 446
  • 738