Remove rows in a dataframe based on number of element per factor in one column in R

Question

I have this dataset:

df <- structure(list(Species = c("Paranthropus robustus", "Paranthropus robustus", 
"Paranthropus robustus", "Australopithecus afarensis", "Australopithecus afarensis", 
"Australopithecus afarensis", "Australopithecus afarensis", "Australopithecus afarensis", 
"Paranthropus boisei", "Australopithecus afarensis", "Paranthropus boisei", 
"Australopithecus africanus", "Australopithecus africanus", "Australopithecus africanus", 
"Paranthropus robustus", "Australopithecus africanus", "Australopithecus africanus", 
"Paranthropus robustus", "Australopithecus africanus", "Paranthropus robustus", 
"Paranthropus robustus", "Paranthropus robustus", "Paranthropus robustus", 
"Paranthropus robustus", "Paranthropus robustus", "Paranthropus robustus", 
"Paranthropus robustus", "Paranthropus robustus", "Paranthropus robustus", 
"Paranthropus robustus", "Paranthropus robustus", "Paranthropus robustus", 
"Paranthropus robustus", "Paranthropus robustus", "Paranthropus robustus", 
"Paranthropus robustus", "Paranthropus robustus", "Australopithecus africanus", 
"Australopithecus afarensis", "Australopithecus afarensis", "Australopithecus afarensis", 
"Australopithecus afarensis", "Australopithecus afarensis", "Australopithecus afarensis", 
"Australopithecus afarensis", "Australopithecus afarensis", "Australopithecus afarensis", 
"Australopithecus afarensis", "Paranthropus boisei", "Paranthropus boisei", 
"Australopithecus africanus", "Australopithecus africanus", "Paranthropus boisei", 
"Paranthropus boisei", "Paranthropus robustus", "Paranthropus robustus", 
"Paranthropus robustus", "Paranthropus robustus", "Paranthropus robustus", 
"Paranthropus robustus", "Paranthropus robustus", "Paranthropus robustus", 
"Australopithecus africanus", "Paranthropus robustus", "Ardipithecus ramidus", 
"Ardipithecus ramidus", "Ardipithecus ramidus", "Homo habilis", 
"Homo habilis", "Paranthropus robustus"), `Site / Population` = c("Drimolen", 
"Drimolen", "Drimolen", "Nefuraytu: Woranso-Mille (Central Afar, Ethiopia)", 
"Laetolil", "Laetolil", "Laetolil", "Laetolil", "Lake Turkana", 
"Laetolil", NA, "Makapansgat", "Makapansgat", "Makapansgat", 
"Kroomdrai", "Taung", "Taung", "Kroomdrai", "Makapansgat", "Swartkrans", 
"Swartkrans", "Swartkrans", "Swartkrans", "Swartkrans", "Swartkrans", 
"Swartkrans", "Swartkrans", "Swartkrans", "Swartkrans", "Swartkrans", 
"Swartkrans", "Swartkrans", "Swartkrans", "Swartkrans", "Swartkrans", 
"Swartkrans", "Swartkrans", "Sterkfontein", "Hadar", "Hadar", 
"Hadar", "Hadar", "Hadar", "Hadar", "Hadar", "Hadar", "Hadar", 
"Hadar", "Koobi Fora", "East Turkana", "Makapansgat", "Makapansgat", 
"Peninj", "Peninj", "Swartkrans", "Swartkrans", "Swartkrans", 
"Swartkrans", "Swartkrans", "Swartkrans", "Swartkrans", "Swartkrans", 
"Sterkfontein", "Kroomdrai", "Aramis, Middle Awash", "Aramis, Middle Awash", 
"Aramis, Middle Awash", "Sterkfontein", "Sterkfontein", "Sterkfontein"
), Specimen = c("DNH 7", "DNH 8", "DNH 8", "NFR-VP-1/29", "LH-2", 
"LH-3", "LH-4", "LH-4", "KNM-WT 16005", "LH-16", "KNM-ER 15930", 
"MLD 2", "MLD 2", "Rev. Paper", "Rev. Paper", "Rev. Paper", "Rev. Paper", 
"Rev. Paper", "Rev. Paper", "SK 104", "SK 23", "SK 23", "SK 25", 
"SK 25", "SK 34", "SK 55b", "SK 55b", "SK 6", "SK 6", "SK 61", 
"SK 63", "SK 63", "SK 828", "SK 838", "SK 843", "SK 845", "SK 846", 
"Sts 52b", "AL 128-23", "AL 145-35", "AL 266-1", "AL 288-1i", 
"AL 333-74", "AL 333w-1", "AL 333w-1", "AL 333w-32,60", "AL 400-1a", 
"AL 400-1a", "KNM-ER 3230", "KNM-ER 729", "MLD 18/4/24", "MLD 40", 
"NMT-W64-160", "NMT-W64-160", "SK 1587", "SK 1648", "SK 34", 
"SK 34", "SK 843", "SK 858", "SK 876", "SK 876", "Stw 14", "TM 1517b", 
"ARA-VP-1/128", "ARA-VP-1/128", "ARA-VP-1/200", "Stw 151", "Stw 151", 
"Stw 566")), class = "data.frame", row.names = c(9L, 26L, 28L, 
385L, 398L, 408L, 416L, 417L, 428L, 432L, 444L, 545L, 546L, 549L, 
550L, 552L, 553L, 555L, 557L, 560L, 563L, 564L, 569L, 570L, 572L, 
577L, 578L, 581L, 582L, 587L, 588L, 589L, 591L, 592L, 595L, 598L, 
600L, 601L, 710L, 712L, 716L, 719L, 722L, 724L, 726L, 728L, 735L, 
738L, 744L, 748L, 753L, 758L, 791L, 794L, 802L, 804L, 806L, 809L, 
812L, 814L, 816L, 819L, 824L, 825L, 841L, 842L, 846L, 897L, 898L, 
899L))

If we see the head(df):

head(df)
                       Species                                 Site / Population    Specimen
9        Paranthropus robustus                                          Drimolen       DNH 7
26       Paranthropus robustus                                          Drimolen       DNH 8
28       Paranthropus robustus                                          Drimolen       DNH 8
385 Australopithecus afarensis Nefuraytu: Woranso-Mille (Central Afar, Ethiopia) NFR-VP-1/29
398 Australopithecus afarensis                                          Laetolil        LH-2
408 Australopithecus afarensis                                          Laetolil        LH-3

First, we need to look at the first column (Species). If the number of rows with one category (i.e. Homo habilis) is less than 3 (which is the case), I would like to remove all the rows with Homo habilis). Obviously, I would like to count the total number of rows per Species and check that their number is less than 3.

How could I do it?

This has already been addressed in the duplicate question linked to this question. Note also that next time you ask a question, it's preferrable to provide a minimal example. Here you can have just provide a data frame with multiple groups, including one with 3 or less rows, instead of multiple columns and long categories. Check here for more: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Maël, Feb 09 '23 at 09:17

Remove rows in a dataframe based on number of element per factor in one column in R

0 Answers0