-1

Is there a way to count how many of one value there is in a column (using R)?

I am trying to work out the number of paralogs there are per gene for my biology degree. So far I have a table that looks like this:

                  Gene name                               Paralog
14                 Crabp2                                 Crabp1
15                 Crabp2                                   Rbp2
16                 Crabp2                                   Rbp7
17                 Crabp2                                   Rbp1
18                 Crabp2                                  Fabp5
19                 Crabp2                                  Fabp7
20                 Crabp2                                   Pmp2
21                 Crabp2                                 Fabp12
22                 Crabp2                                Gm37389
23                 Crabp2                                  Fabp3
24                 Crabp2                                  Fabp9
25                 Crabp2                                  Fabp4
26                 Zfp653                             AC163623.1
27                 Zfp653                                 Zfp276
28                 Zfp653                                  Zfp91
29                 Zfp653                                 Zfp692
30             AC163623.1                                 Zfp653
31             AC163623.1                                 Zfp276
32             AC163623.1                                  Zfp91
33             AC163623.1                                 Zfp692
34                   Apom                                       
35                  Map10 

As you can see from the table, 'Crabp2' has many paralogs. There are many paralogs to a gene - some genes have none - but I want to find out how many paralogs per gene. I'm trying to get a table that looks something like this:

                  Gene            freq
                 Crabp2            12
                 Zfp653            4
               AC163623.1          4
                  Apom             0
                 Map10             0

The original table consists of 25,000 rows.

zx8754
  • 52,746
  • 12
  • 114
  • 209
Jack Dean
  • 163
  • 1
  • 7

1 Answers1

1

If there is exactly one row per paralog then

as.data.frame(table(na.omit(your_data[["Gene name"]])))

should work.

table() is the main function; na.omit() should get rid of non-paralog values (this may depend on how your data are coded); as.data.frame() just changes the output format.

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • You are ignoring that the OP wants Paralog != NA – G5W Jan 20 '18 at 20:25
  • @G5W he can do something like df <- filter(df, !is.na(column_name)) beforehand. also table(data$column) works too without the ugly double brackets. – Prometheus Jan 20 '18 at 20:28
  • I used the ugly double brackets because OP had a space in the column name and I think `[[ ]]` is more explicit/less magical than using back-ticks to protect. – Ben Bolker Jan 20 '18 at 20:28
  • 1
    @Prometheus : I stuck in `na.omit()` (I don't like to introduce tidyverse unless it's actually providing something beyond what can be done easily in base R) – Ben Bolker Jan 20 '18 at 20:29
  • yup. that makes sense. – Prometheus Jan 20 '18 at 20:31