I wonder if doing something outside of R might be better. Using awk
for instance, using this data in awkgroup.csv
:
"something","group"
1,"A"
2,"A"
3,"A"
4,"A"
5,"A"
6,"B"
7,"B"
8,"B"
9,"B"
10,"B"
11,"C"
12,"C"
13,"C"
14,"C"
15,"C"
16,"D"
17,"D"
18,"D"
19,"D"
20,"D"
We can do
$ awk -F, '$2==inp{line++;if(line<3)print($0)}; $2!=inp{inp=$2;line=0;print($0)};' awkgroup.csv > newdata.csv
$ cat newdata.csv
"something","group"
1,"A"
2,"A"
3,"A"
6,"B"
7,"B"
8,"B"
11,"C"
12,"C"
13,"C"
16,"D"
17,"D"
18,"D"
Basic walk-through (though I do not consider myself an awk
wizard by any account):
$2==inp
(and similarly $2!=inp
) tests if the second column (our grouping variable) has changed since the previous row. The inp
is initially uninitialized so will default to the empty string.
NB: this assumes the data is ordered by-group.
line++;if(line<3)print($0)
is the majority of work, where it tests if line
(our method of tracking lines within the current group) is less than 3
(we use a 0-based line
here) and prints if so. This gives us the top 3 lines per group.
inp=$2;line=0;print($0)
is similar, but is run on the first line within a group; it resets the line
counter, sets what we think the current group is by assigning to inp
, and always prints (since this is the first line within a group).
I couldn't get this to work well in fread(cmd="...")
, likely because it (for some reason) uses shell
in lieu of the admittedly-also-broken (but not here) system
. One can use system
and control where the output goes, then read that normally, perhaps
system2("awk", c("-F, '$2==inp{line++;if(line<3)print($0);};$2!=inp{inp=$2;line=0;print($0)};'",
"awkgroup.csv"),
stdout="awkout.csv" )
fread("awkout.csv")
# something group
# <int> <char>
# 1: 1 A
# 2: 2 A
# 3: 3 A
# 4: 6 B
# 5: 7 B
# 6: 8 B
# 7: 11 C
# 8: 12 C
# 9: 13 C
# 10: 16 D
# 11: 17 D
# 12: 18 D
FYI, system2
is not really any better than system
: it just concatenates the quoted (good!) command=
with all unquoted args
(bad!):
command <- paste(c(shQuote(command), env, args), collapse = " ")
which is why I'm able to cheat a little and combine all of awk
's args=
into a vector.
From here, you need to control two things about this:
- change
$2
(in three places) to the column number of your grouping variable;
- change
< 3
to be whatever limit you want (recall that it is 0-based, so < 3
gives you 3 entries, not 2).