0

I have got a dataset of genetic data for a bacterial strain collection and I want to plot a heat map showing the prevalence of a number of alleles for my strains (grouped by a grouping variable).

My raw data is a large data frame consisting of a number of strains (rows), a grouping variable (1 column) and multiple genetic determinants, each an own variable. I am struggling to create a heat map with ggplot since that requires a matrix of the data that needs to be plotted and I don't know how to transform my raw data into the required matrix. My original data frame looks like this (excerpt, for simplicity reasons):

   Sample group A B C D E F
1       1    10 0 1 0 0 0 0
2       2    10 0 1 0 0 0 0
3       3    10 0 1 0 0 0 0
4       4    10 0 1 0 0 0 0
5       5    38 0 1 0 0 0 0
6       6    38 0 1 0 0 0 0
7       7    38 1 1 0 0 0 0
8       8    69 0 1 0 0 0 0
9       9    69 0 1 0 0 0 0
10     10    69 0 1 0 0 0 0
11     11    69 0 1 0 0 0 0
12     12    69 0 1 0 0 0 0
13     13    69 0 1 0 0 0 0
14     14    73 0 0 0 0 0 0
15     15    73 0 0 0 0 0 0
16     16    73 0 0 0 0 0 0
17     17    73 0 0 0 0 0 0
18     18    73 0 0 0 0 0 0
19     19    73 0 0 0 0 0 0
20     20    73 0 0 0 0 0 0
21     21    73 0 0 0 0 0 0
22     22    73 0 0 0 0 0 0
23     23    73 0 0 0 0 0 0
24     24    73 0 0 0 0 0 0
25     25    73 1 0 0 0 0 0
26     26    73 0 0 0 0 0 0
27     27    95 0 0 0 0 0 0
28     28    95 0 0 0 0 0 0
29     29    95 0 0 0 0 0 0
30     30    95 0 0 0 0 0 0
31     31    95 0 0 0 0 0 0
32     32   127 0 0 0 0 0 0
33     33   127 0 0 0 0 0 0
34     34   127 0 0 0 0 0 0
35     35   127 0 0 0 0 0 0

A-F are the allele variables and '0' means it is not present whereas '1' means it is. What I now want to do is to count the occurrence of '1' for each group and get a percentage in relation to all observations for that group (i.e. '1'/('1'+'0') for group 10, 38, 69, 73, 95, 127). Then this needs to be in a matrix like this:

  group     A B C D E F
1    10 0.000 1 0 0 0 0
2    38 0.333 1 0 0 0 0
3    69 0.000 1 0 0 0 0
4    73 0.077 0 0 0 0 0
5    95 0.000 0 0 0 0 0
6   127 0.000 0 0 0 0 0

My dataset is really huge so manually calculating and "typing" the matrix like in this example is not a feasible option. Is there any smart way to do this in R and then plot it as a heat map?

Any help is much appreciated. Thank you

kruemelprinz

0 Answers0