4

I need to split huge (>1 Gb) CSV files containing 50K+ columns each on a daily basis.

I've found Miller as an interesting and performant tool for such a task.

But I'm stuck on Miller's documentation.

How could I split one CSV to N smaller CSV files where N is a number of rows in my source file?

aborruso
  • 4,938
  • 3
  • 23
  • 40
franchb
  • 1,174
  • 4
  • 19
  • 42

2 Answers2

5

try with this script

mlr --csv put -S 'if (NR % 10000 == 0) {$rule=NR} else {$rule = ""}' \
then fill-down -f rule \
then put -S 'if ($rule=="") {$rule="0"}' \
then put -q 'tee > $rule.".csv", $*' input.csv

Make a copy of your CSV in a new folder, and then run this script on it. It will produce a csv file for every 10000 rows.

aborruso
  • 4,938
  • 3
  • 23
  • 40
  • Strange, Miller should have a `split` verb https://miller.readthedocs.io/en/latest/reference-verbs/#split but the 6.0.0 release doesn't recognize it – Fravadona Jan 29 '22 at 23:39
  • @Fravadona it is not yet in the stable version, you can download the unstable from here, at the bottom of the page https://github.com/johnkerl/miller/actions/runs/1766863694 – aborruso Jan 30 '22 at 07:56
  • 1
    Thanks, tested it and it works; shouldn't you make an update to this answer? Now the problem has become quite trivial – Fravadona Jan 30 '22 at 16:46
  • 1
    @Fravadona I will do it, when there will be the stable release. Thank you – aborruso Jan 30 '22 at 19:06
3

the answer from aborruso does add a new column rule to the output csv files. If you want to avoid this, use emit with mapexcept instead of tee in the last step, like this:

mlr --csv put -S 'if (NR % 10000 == 0) {$rule=NR} else {$rule = ""}' \
then fill-down -f rule \
then put -S 'if ($rule=="") {$rule="0"}' \
then put -q 'emit > $rule.".csv", mapexcept($*, "rule")' input.csv
tje
  • 31
  • 1