7

I've just started using the geom_map function in ggplot2. After having read the 29 posts I could find on geom_map on here, I'm still running into the same problem.

My data frame is ridiculously big and contains over 2000 rows. It's basically data on a specific gene (TP53) that's been compiled from the WHO.

Please download it from here.

The header looks as follows:

> head(ARCTP53_SOExample)
  Mutation_ID MUT_ID hg18_Chr17_coordinates hg19_Chr17_coordinates ExonIntron Genomic_nt Codon_number
1          16   1789                7519192                7578467     5-exon      12451          155
2          13   1741                7519200                7578475     5-exon      12443          152
3          17   2143                7519131                7578406     5-exon      12512          175
4          14   2143                7519131                7578406     5-exon      12512          175
5          15   2168                7519128                7578403     5-exon      12515          176
6          12   3737                7517845                7577120     8-exon      13798          273
  Description c_description g_description       g_description_hg18 WT_nucleotide Mutant_nucleotide
1         A>G      c.463A>G  g.7578467T>C NC_000017.9:g.7519192T>C           A                   G
2         C>T      c.455C>T  g.7578475G>A NC_000017.9:g.7519200G>A           C                   T
3         G>A      c.524G>A  g.7578406C>T NC_000017.9:g.7519131C>T           G                   A
4         G>A      c.524G>A  g.7578406C>T NC_000017.9:g.7519131C>T           G                   A
5         G>T      c.527G>T  g.7578403C>A NC_000017.9:g.7519128C>A           G                   T
6         G>A      c.818G>A  g.7577120C>T NC_000017.9:g.7517845C>T           G                   A
  Splice_site CpG_site           Type Mut_rate WT_codon Mutant_codon WT_AA Mutant_AA ProtDescription
1          no       no        A:T>G:C    0.170      ACC          GCC   Thr       Ala         p.T155A
2          no      yes G:C>A:T at CpG    1.243      CCG          CTG   Pro       Leu         p.P152L
3          no      yes G:C>A:T at CpG    1.280      CGC          CAC   Arg       His         p.R175H
4          no      yes G:C>A:T at CpG    1.280      CGC          CAC   Arg       His         p.R175H
5          no       no        G:C>T:A    0.054      TGC          TTC   Cys       Phe         p.C176F
6          no      yes G:C>A:T at CpG    1.335      CGT          CAT   Arg       His         p.R273H
  Mut_rateAA   Effect Structural_motif Putative_stop Sample_Name Sample_ID Sample_source Tumor_origin Grade
1      0.170 missense NDBL/beta-sheets             0    CAS91-19        17       surgery      primary      
2      1.243 missense NDBL/beta-sheets             0     CAS91-4        14       surgery      primary      
3      1.280 missense            L2/L3             0    CAS91-13        12       surgery      primary      
4      1.280 missense            L2/L3             0     CAS91-5        15       surgery      primary      
5      0.054 missense            L2/L3             0     CAS91-1        16       surgery      primary      
6      1.335 missense          L1/S/H2             0     CAS91-3        13       surgery      primary      
  Stage TNM p53_IHC KRAS_status Other_mutations Other_associations
1              <NA>        <NA>            <NA>                   
2              <NA>        <NA>            <NA>                   
3              <NA>        <NA>            <NA>                   
4              <NA>        <NA>            <NA>                   
5              <NA>        <NA>            <NA>                   
6              <NA>        <NA>            <NA>                   
                                                                 Add_Info Individual_ID  Sex Age Ethnicity
1 Mutation only present in adjacent dysplastic area (Barrett's esophagus)            17 <NA>  NA          
2 Mutation only present in adjacent dysplastic area (Barrett's esophagus)            14 <NA>  NA          
3 Mutation only present in adjacent dysplastic area (Barrett's esophagus)            12 <NA>  NA          
4 Mutation only present in adjacent dysplastic area (Barrett's esophagus)            15 <NA>  NA          
5                                                                                    16 <NA>  NA          
6      Mutation absent from adjacent dysplasia area (Barrett's esophagus)            13 <NA>  NA          
  Geo_area Country            Development       Population   Region TP53polymorphism Germline_mutation
1              USA More developed regions Northern America Americas                                 NA
2              USA More developed regions Northern America Americas                                 NA
3              USA More developed regions Northern America Americas                                 NA
4              USA More developed regions Northern America Americas                                 NA
5              USA More developed regions Northern America Americas                                 NA
6              USA More developed regions Northern America Americas                                 NA
  Family_history Tobacco Alcohol Exposure Infectious_agent Ref_ID Cross_Ref_ID  PubMed Exclude_analysis
1                   <NA>    <NA>     <NA>             <NA>      4           NA 1868473            False
2                   <NA>    <NA>     <NA>             <NA>      4           NA 1868473            False
3                   <NA>    <NA>     <NA>             <NA>      4           NA 1868473            False
4                   <NA>    <NA>     <NA>             <NA>      4           NA 1868473            False
5                   <NA>    <NA>     <NA>             <NA>      4           NA 1868473            False
6                   <NA>    <NA>     <NA>             <NA>      4           NA 1868473            False
  WGS_WXS
1      No
2      No
3      No
4      No
5      No
6      No

In any case, I want to create a simple world-map which will color the countries, where this mutation has been studied, and if more or less "mutation signatures" come from these countries.

You might understand better what I'm trying to do if you see this:

summary(ARCTP53_SOExample$Country)
Australia                  Brazil                  Canada                   China 
                      1                     127                      76                     519 
       China, Hong-Kong Chinese Taipei (Taiwan)          Czech Republic                   Egypt 
                     52                      36                       9                       9 
                 France                 Germany                   India                    Iran 
                    195                      10                      63                     112 
                Ireland                   Italy                   Japan                   Kenya 
                     25                      30                     414                      11 
           South Africa                   Spain             Switzerland                Thailand 
                     13                       2                      24                      35 
        The Netherlands                      UK                 Uruguay                     USA 
                      6                      17                       6                     189 
                   NA's 
                     30 

So some countries come up multiple times in my data.frame.

So this is what I did in the hope of getting my desired map:

library(ggplot2)
library(maps)
world_map<-map_data("world")
ggplot(ARCTP53_SOExample)+geom_map(map = world_map, aes(map_id = Country,fill = Country),
+ colour = "black") +
+ expand_limits(x = world_map$long, y = world_map$lat)

And this is what I get: This map only contains the countries in my list...

Does anyone have any input on what I'm doing wrong?

Further, what I'd then like to do down the road, is add geom_bar() of the ExonIntron column onto the different countries. But, I'd like to try and generate the right map first?

Thanks a mill.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
OFish
  • 474
  • 1
  • 9
  • 19

1 Answers1

13

The missing countries in the ARC… data frame == missing regions on the map which can be compensated for with base layer made from the world_map data frame:

library(maps)

world_map<-map_data("world")

gg <- ggplot(ARCTP53_SOExample)

# need one layer with ALL THE THINGS (well, all the regions)
gg <- gg + geom_map(dat=world_map, map = world_map, 
                    aes(map_id=region), fill="white", color="black")

# now we can put the layer we really want
gg <- gg + geom_map(map = world_map, 
                    aes(map_id = Country, fill = Country), colour = "black")

gg <- gg + expand_limits(x = world_map$long, y = world_map$lat)
gg <- gg + theme(legend.position="none")
gg

map1

I removed the legend since using a choropleth kinda assumes folks know geography.

NOTE: Using a distinct color per-region (country) is really not a good idea. Since you're really trying to only highlight where the mutation has been studied, a single color should suffice:

gg <- ggplot(ARCTP53_SOExample)
gg <- gg + geom_map(dat=world_map, map = world_map, aes(map_id=region), 
                    fill="white", color="black")
gg <- gg + geom_map(map = world_map, aes(map_id = Country), 
                    fill = "steelblue", colour = "black")
gg <- gg + expand_limits(x = world_map$long, y = world_map$lat)
gg <- gg + theme(legend.position="none")
gg

map2

Since you eventually want to tell the story of ExonIntron, you might want to consider using that as the color of the choropleth. I know nothing of genes, so I don't know if a gradient makes sense or if distinct colors are the way to go. I will posit that the the plethora of distinct colors created by the following code makes me think that you might want to do one gradient scale for intron and one for extron. Again, I'm not a gene person.

gg <- ggplot(ARCTP53_SOExample)
gg <- gg + geom_map(dat=world_map, map = world_map, aes(map_id=region), 
                    fill="white", color="black")
gg <- gg + geom_map(map = world_map, aes(map_id = Country, fill = ExonIntron), 
                    colour = "black")
gg <- gg + expand_limits(x = world_map$long, y = world_map$lat)
gg

map3

Some of the colors are either in really tiny regions, or they are in regions whose names don't match the names in world_map$region. You'll probably want to take a look at that. This:

wm.reg <- unique(as.character(world_map$region))
arc.reg <- unique(as.character(ARCTP53_SOExample$Country))

arc.reg %in% wm.reg
##  [1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
## [14]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE

kinda shows some are missing.

You might also want to consider doing the legend differently (i.e. put it on the bottom) if you go with a legend vs build your own table of results.

UPDATE

I almost forgot. Since you (most likely) have no need to Antarctica, you should get rid of it since it's eating up quite a bit of valuable space:

world_map <- subset(world_map, region!="Antarctica")

gg <- ggplot(ARCTP53_SOExample)
gg <- gg + geom_map(dat=world_map, map = world_map, aes(map_id=region), 
                    fill="white", color="black")
gg <- gg + geom_map(map = world_map, aes(map_id = Country, fill = ExonIntron), 
                    colour = "black")
gg <- gg + expand_limits(x = world_map$long, y = world_map$lat)
gg <- gg + theme(legend.position="none")
gg

map4

(NOTE: I got rid of the legend since I really think you should re-think how you want the colors on the map and then use an extra table or plot to act as the legend)


FINAL UPDATE (per OP request in the comments below)

library(ggplot2)
library(maps)
library(plyr)
library(gridExtra)

ARCTP53_SOExample <- read.csv("dat.csv")

# reduce all the distinct exon/introns to just exon or intron

ARCTP53_SOExample$EorI <- factor(ifelse(grepl("exon", 
                                              ARCTP53_SOExample$ExonIntron, 
                                              ignore.case = TRUE), 
                                        "exon", "intron"))

# extract summary data for the two variables we care about for the map

arc.combined <- count(ARCTP53_SOExample, .(Country, EorI))
colnames(arc.combined) <- c("region", "EorI", "ei.ct")

# get total for country (region) and add to the summary info

arc.combined <- merge(arc.combined, count(arc.combined, .(region), wt_var=.(ei.ct)))
colnames(arc.combined) <- c("region", "EorI", "ei.ct", "region.total")

# it wasn't specified if the "EorI" is going to be used on the map so 
# we won't use it below (but we could, now)

# get map and intercourse Antarctica

world_map <- map_data("world")
world_map <- subset(world_map, region!="Antarctica")

# this will show the counts by country with all of the "chart junk" removed
# and the "counts" scaled as a gradient, and with the legend at the top

gg <- ggplot(arc.combined)
gg <- gg + geom_map(dat=world_map, map = world_map, aes(map_id=region), 
                    fill="white", color="#7f7f7f", size=0.25)
gg <- gg + geom_map(map = world_map, aes(map_id = region, fill = region.total), size=0.25)
gg <- gg + scale_fill_gradient(low="#fff7bc", high="#cc4c02", name="Tumor counts")
gg <- gg + expand_limits(x = world_map$long, y = world_map$lat)
gg <- gg + labs(x="", y="", title="Tumor contribution by country")
gg <- gg + theme(panel.grid=element_blank(), panel.border=element_blank())
gg <- gg + theme(axis.ticks=element_blank(), axis.text=element_blank())
gg <- gg + theme(legend.position="top")
gg

mapb

# BUT you might want to show the counts by intron/exon by country
# SO we do a separate map for each factor and combine them
# with some grid magic. This provides more granular control over
# each choropleth (in the event one wanted to tweak one or the other)

# exon

gg <- ggplot(arc.combined[arc.combined$EorI=="exon",])
gg <- gg + geom_map(dat=world_map, map = world_map, aes(map_id=region), 
                    fill="white", color="#7f7f7f", size=0.25)
gg <- gg + geom_map(map = world_map, aes(map_id = region, fill = ei.ct), size=0.25)
gg <- gg + scale_fill_gradient(low="#f7fcb9", high="#238443", name="Tumor counts")
gg <- gg + expand_limits(x = world_map$long, y = world_map$lat)
gg <- gg + labs(x="", y="", title="Tumor contribution by 'exon' & country")
gg <- gg + theme(panel.grid=element_blank(), panel.border=element_blank())
gg <- gg + theme(axis.ticks=element_blank(), axis.text=element_blank())
gg <- gg + theme(legend.position="top")
gg.exon <- gg

# intron

gg <- ggplot(arc.combined[arc.combined$EorI=="intron",])
gg <- gg + geom_map(dat=world_map, map = world_map, aes(map_id=region), 
                    fill="white", color="#7f7f7f", size=0.25)
gg <- gg + geom_map(map = world_map, aes(map_id = region, fill = ei.ct), 
                    colour = "#7f7f7f", size=0.25)
gg <- gg + scale_fill_gradient(low="#ece7f2", high="#0570b0", name="Tumor counts")
gg <- gg + expand_limits(x = world_map$long, y = world_map$lat)
gg <- gg + labs(x="", y="", title="Tumor contribution by 'intron' & country")
gg <- gg + theme(panel.grid=element_blank(), panel.border=element_blank())
gg <- gg + theme(axis.ticks=element_blank(), axis.text=element_blank())
gg <- gg + theme(legend.position="top")
gg.intron <- gg

# use some grid magic to combine them into one plot

grid.arrange(gg.exon, gg.intron, ncol=1)

mapb

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
  • You my dear sir are an absolute genius! Thank you so much. Now that I've learnt how to build the map, I of course have some further questions. First, as you can see in the summary(Country), some countries have contributed more tumors that were studied than others (that's what the summary is telling you). How could one use the "count" of the Country variable to be used as a colour to fill the respective area? Also - and this is a syntax issue - how would I define the ExonIntron to only be counted as either "Exon" or Intron" as you propose? Thanks so much! Really amazing! – OFish Apr 05 '14 at 07:08
  • 1
    done. you should be well on your way to a choropleth master :-) – hrbrmstr Apr 05 '14 at 11:40
  • 1
    forgot to add that i didn't "solve" your errant country/region names issue, so you *really* need to do that before finishing the choropleth. – hrbrmstr Apr 05 '14 at 12:41
  • Ok. This is absolutely amazing. I can't express my gratitude enough. I really understand now how you put the map together, which is so awesome (thank you also 'ggplot2'!), but I realize that I have a huge lack of knowledge in "basic" R syntax. Never seen the use of "grepl" and also think have only used the 'plyr' package once (not even sure how). Never used the 'merge' function either....holy cow. Where did you include the 'gridExtra' package in your code? Also regarding your last comment, I know...I'm just thinking of how to do that in a 'data.frame' that has >2000 rows. This is so awesome. – OFish Apr 06 '14 at 02:10
  • 1
    Glad it helped! The 'grid.arrange' function is in the `gridExtra` package. I made the code easier to grab over at github: https://gist.github.com/hrbrmstr/0a330b212b46517ffee9 and included a snippet of how you might go about seeing the errant region names. You can use `aggregate` from the base R package vs `count` from `plyr` but I think the `count` syntax is cleaner. – hrbrmstr Apr 06 '14 at 12:01
  • You're truly a star. I think this goes under one of the most helpful posts I've ever had on SO. May I just ask briefly: In the `merge()` and `count()` function, what are the "." for in the syntax? I checked the help function in R and it didn't explain that. Will def have to find some tutorials on the `plyr`package. Theoretically I would have a further query...but I have a bad conscience asking you for more help. What you've done so far is amazing. If we use any of this in our paper, I really want to acknowledge your help, so might need your proper name for acknowledgement purposes! – OFish Apr 07 '14 at 00:09
  • 1
    :-) so the `.(region)` syntax is syntactic sugar from the `plyr` package for to specifying column names. It's equivalent to `c("region")` (and the `c()` thing isn't necessary, but it's force of habit for me). you can hit me up at bob at rudis dot net for more info at any time (follow-up on SO is kinda hard, anyway :-). just glad to help. – hrbrmstr Apr 07 '14 at 00:26
  • I've also taken our discussion over to github, so not to crowd the SO space. Especially because your code re the errant region names has shown some interesting things. Thx. O. – OFish Apr 07 '14 at 00:28
  • Haha. You're killing me. Always got a new surprise lined up haven't you? Will send you a mes as easier I guess. Thanks again. Totally amazed by your skills. – OFish Apr 07 '14 at 00:46