I'm working on a machine translation project in which I have 4.5 million lines of text in two languages, English and German. I would like to shuffle these lines prior to dividing the data into shards on which I will train my model. I know the shuf
command described here allows one to shuffle lines in one file, but how can I ensure that corresponding lines in the second file are also shuffled into the same order? Is there a command to shuffle lines in both files?
Asked
Active
Viewed 122 times
1

Vivek Subramanian
- 1,174
- 2
- 17
- 31
1 Answers
2
TL;DR
paste
to create separate columns from two files into a single fileshuf
on the single filecut
to split the columns
Paste
$ cat test.en
a b c
d e f
g h i
$ cat test.de
1 2 3
4 5 6
7 8 9
$ paste test.en test.de > test.en-de
$ cat test.en-de
a b c 1 2 3
d e f 4 5 6
g h i 7 8 9
Shuffle
$ shuf test.en-de > test.en-de.shuf
$ cat test.en-de.shuf
d e f 4 5 6
a b c 1 2 3
g h i 7 8 9
Cut
$ cut -f1 test.en-de.shuf> test.en-de.shuf.en
$ cut -f2 test.en-de.shuf> test.en-de.shuf.de
$ cat test.en-de.shuf.en
d e f
a b c
g h i
$ cat test.en-de.shuf.de
4 5 6
1 2 3
7 8 9

alvas
- 115,346
- 109
- 446
- 738