I need to duplicate those levels whose frequency in my factor variable called groups
is less than 500.
> head(groups)
[1] 0000000 1000000 1000000 1000000 0000000 0000000
75 Levels: 0000000 0000001 0000010 0000100 0000110 0001000 0001010 0001100 0001110 0010000 0010010 0010100 0010110 ... 1111110
For example:
> table(group)
group
0000000 0000001 0000010 0000100 0000110 0001000 0001010 0001100 0001110 0010000 0010010 0010100 0010110 0011000 0011010 0011100
58674 6 1033 654 223 1232 31 222 17 818 132 32 15 42 9 9
0011110 0100000 0100001 0100010 0100100 0100101 0100110 0101000 0101010 0101100 0101110 0110000 0110010 0110100 0110110 0111000
1 10609 1 487 64 1 58 132 11 12 3 142 27 9 7 11
0111010 0111100 0111110 1000000 1000001 1000010 1000011 1000100 1000101 1000110 1001000 1001001 1001010 1001100 1001110 1010000
5 1 2 54245 10 1005 1 329 1 138 573 1 31 71 11 969
1010010 1010100 1010110 1011000 1011010 1011100 1011110 1100000 1100001 1100010 1100011 1100100 1100110 1101000 1101010 1101011
147 29 21 63 15 10 4 14161 6 770 1 142 96 260 23 1
1101100 1101110 1110000 1110001 1110010 1110100 1110110 1111000 1111010 1111100 1111110
34 16 439 2 103 13 26 36 13 8 5
Groups 0000001
, 0000110
, 0001010
, 0001100
... must be duplicated up to 500.
The ideal would be to have a "sample balanced data" of groups
that duplicate those levels often less than 500 and penalize the rest (Levels more than 500 frequency) until reaching 500.