Miller: selecting columns based on their last value

Question

I'm processing a TSV file with a hundred columns and I would like to pick some of them based on their last values (ie. the value on the last row).

For instance with the following data (and I let you imagine 96 more columns!):

Year    An  Ti  Gi  Na
2000    5   10  2   3
2010    3   2   5   7
2020    6   3   5   6

I would like to select all columns whose value on the last row is at least 5 (and, better, I would like to select all columns whose value on the last row is at least the value of the row Gi).

So in the end, I would like the following output (the column Ti is discarded because its value on the last row is 3 which is below our threshold).

Year    An  Gi  Na
2000    5   2   3
2010    3   5   7
2020    6   5   6

Do you have any ideas?

Thanks!

score 2 · Accepted Answer · answered Jul 23 '23 at 06:57

2

You could write this bash script

#!/bin/bash

gi_value=$(mlr --t2n tail -n 1 then cut -f Gi input.tsv)

gi_columns=$(mlr --t2n tail -n 1 then label 1 then reshape -r '^[a-zA-Z]' -o k,v then filter '$v >= '"$gi_value"'' then cut -f k then nest --ivar "," -f k input.tsv)

mlr --tsv cut -f Year,"$gi_columns" input.tsv

First extract the Gi value

mlr --t2n tail -n 1 then cut -f Gi input.tsv

to populate the gi_value var with 5 value.

Then extract the list of columns of the last row in wich the value is >= 5

mlr --t2n tail -n 1 \
then label 1 \
then reshape -r '^[a-zA-Z]' -o k,v \
then filter '$v >= '"$gi_value"'' \
then cut -f k \
then nest --ivar "," -f k input.tsv

This gives you in output An,Gi,Na.

The last step is to cut An,Gi,Na columns:

mlr --tsv cut -f Year,"$gi_columns" input.tsv

The output is

Year    An      Gi      Na
2000    5       2       3
2010    3       5       7
2020    6       5       6

answered Jul 23 '23 at 06:57

aborruso

4,938
3
23
40

1

Thanks! I hoped that it could be solved with a single call to `mlr` but that's already very useful! – mikael-s Jul 23 '23 at 14:39
@mikael-s I think you can write some DSL script, and put it inside miller "put" verb. This way it's easier for me – aborruso Jul 23 '23 at 14:43
1

@mikael-s using a single `mlr` call would need to store the whole file in RAM until you reach the last row, which is where you can determine what to keep or discard; that's not optimal and can lead to a crash if the input file doesn't fit in RAM. @aborruso: here's a different way to get the list of fields that satisfy the condition: `mlr --itsv tail -n 1 then put -q 'print joinv(get_keys(select($*, func(k,v) {return v >= $Gi})), ",")'` – Fravadona Aug 04 '23 at 09:14
1

@Fravadona unfortunately I'm bad at using arrays. Every time I read you and think "wow!" – aborruso Aug 04 '23 at 13:30

Fravadona · Answer 2 · 2023-08-06T16:17:20.830

You could use tac and a little DSL program for selecting/cutting the fields. The downside is that it will load the whole file into RAM (that's what I observed with mlr's current implementation):

mlr --tsv tac then put '
    is_null(@m) { @m = apply($*, func(k,v) { return {k: v >= $Gi}; }); }
    $* = select($*, func(k,v) { return @m[k]; });
' file.tsv

Year    An  Gi  Na
2020    6   5   6
2010    3   5   7
2000    5   2   3

remark: the order of the records are reversed in the output, so you might want to add an other tac to the chain for getting the original order back:

mlr --tsv tac then put '...' then tac file.tsv

Miller: selecting columns based on their last value

2 Answers2