4

I have been searching for this simple thing for hours now, but to no avail. I have a dataframe with one of the columns the variable "country". I want two things the following:

  • Plot the most frequent countries, most frequent on top (partial solution found EDIT full solution found >> focus question on limiting output in bar plot based on frequency);
  • Only show the top x "most frequent" countries, moving the rest into 'Other' variable.

I tried to ggplot table() or summary() but that does not work. Is it even possible within ggplot, or should I use barchart (I managed to do this using barchart, just using summary(df$something) and adding max = x). I also wanted to stack the output (different questions about country).

Most frequent countries on top:

ggplot(aDDs,aes(x=
                  factor(answer,
                         levels=names(sort(table(answer),increasing=TRUE))
                         ),fill=question
                )
      ) + geom_bar() + coord_flip()

Suggestions are very very welcome.

====== EDIT3: I continued working on the code based on the suggestion by @CMichael, but now encountered another, quite strange, thing. Because this 'ifelse' problem concerns a slightly one question than my original one, I have posted a separate question for this matter. Please check it here: R: ifelse function returns vector position instead of value (string)

====== EDIT:

The aDDs example is reproduced below - aDDs dataset can be downloaded here:

temp <- structure(list(student = c(2270285L, 2321254L, 75338L, 2071594L,1682771L, 1770356L, 2155693L, 3154864L, 3136979L, 2082311L),answer = structure(c(181L, 87L, 183L, 89L, 115L, 183L, 172L,180L, 175L, 125L), .Label = c("Congo", "Guinea-Bissau", "Solomon Islands","Central African Rep", "Comoros", "Equatorial Guinea", "Liechtenstein","Nauru", "Brunei", "Djibouti", "Kiribati", "Papua New Guinea","Samoa", "South Sudan", "Tajikistan", "Tonga", "Bhutan","Gabon", "Laos", "Lesotho", "Maldives", "Micronesia", "St Kitts and Nevis","Mozambique", "Niger", "Andorra", "Cape Verde", "Mauritania","Antigua and Deps", "Chad", "Guinea", "Malta", "Burundi","Eritrea", "Iceland", "Kyrgyzstan", "Turkmenistan", "Azerbaijan","Dominica", "Belize", "Malawi", "Mali", "Moldova", "Benin","Cuba", "Gambia", "Luxembourg", "St Lucia", "Angola", "Cambodia","Georgia", "Madagascar", "Oman", "Kosovo", "Kuwait", "Namibia","Bahrain", "Congo - Democratic Rep", "Montenegro", "Senegal","Sierra Leone", "Togo", "Botswana", "Fiji", "Libya", "Uzbekistan","Guyana", "Mongolia", "Somalia", "Zambia", "Estonia", "Ivory Coast","Myanmar", "Grenada", "Qatar", "Saint Vincent and the Grenadines","Tanzania", "Armenia", "Bahamas", "Belarus", "Burkina", "Liberia","Afghanistan", "Latvia", "Yemen", "Mauritius", "Albania","Barbados", "Iraq", "Macedonia", "Nicaragua", "Panama", "Slovenia","Lebanon", "Slovakia", "Kazakhstan", "Paraguay", "Korea South","Suriname", "Czech Republic", "Rwanda", "Haiti", "Lithuania","Israel", "Zimbabwe", "Cyprus", "Honduras", "Uruguay", "Syria","Finland", "Tunisia", "Taiwan", "Uganda", "Denmark", "Austria","Sri Lanka", "Vietnam", "Bosnia Herzegovina", "Thailand","Norway", "Trinidad and Tobago", "Switzerland", "Nepal","Sudan", "Jamaica", "Japan", "United Arab Emirates", "Bolivia","New Zealand", "Ethiopia", "Jordan", "Cameroon", "Croatia","Sweden", "Kenya", "Singapore", "Guatemala", "Ireland Republic","Saudi Arabia", "Bulgaria", "Malaysia", "Belgium", "Dominican Republic","Algeria", "El Salvador", "Bangladesh", "Serbia", "Ghana","Costa Rica", "Indonesia", "Hungary", "Venezuela", "Ecuador","Ukraine", "Romania", "Turkey", "China", "Morocco", "Russian Federation","Peru", "South Africa", "Argentina", "Portugal", "Iran","Poland", "Italy", "Chile", "France", "Germany", "Australia","Philippines", "Egypt", "Greece", "Nigeria", "Canada", "Pakistan","United Kingdom", "Mexico", "Colombia", "Brazil", "Netherlands","Spain", "India", "United States"), class = "factor"), question = c("C1-pres","C1-pres", "C1-pres", "C1-pres", "C1-pres", "C1-pres", "C1-pres","B1-pres", "B1-pres", "B1-pres")), .Names = c("student","answer", "question"), row.names = c("156", "203", "280", "347","412", "478", "534", "1649651", "1649691", "1649763"), class = "data.frame")
Community
  • 1
  • 1
Thieme Hennis
  • 565
  • 2
  • 9
  • 20
  • 4
    Please post a reproducible example. (What is `aDDs`?) You can use `reproduce(aDDs)` . Instructions are here: http://bit.ly/SORepro - [How to make a great R reproducible example](http://bit.ly/SORepro) – Ricardo Saporta Feb 06 '14 at 07:58
  • thanks for the tip: I already wondered about this. Very useful plugin. – Thieme Hennis Feb 06 '14 at 09:53
  • @ThiemeHennis you haven´t provided data for aDDs!!!! – marbel Feb 06 '14 at 12:21
  • @MartínBel now I did! - see link under EDIT – Thieme Hennis Feb 06 '14 at 12:33
  • I made a followup question because the 'ifelse' question is somewhat different than the original one. Both need an answer still, but I think that if I have the answer to the ifelse problem, I'll probably manage. Link to new question: http://stackoverflow.com/questions/21604525/r-ifelse-function-returns-vector-position-instead-of-value-string – Thieme Hennis Feb 06 '14 at 13:37

2 Answers2

3

For the filtering question you should introduce a new column:

data$filteredCountry = ifelse(data$value > threshold, data$country, "other")

Now you can use filteredCountry as your x in the aesthetics.

The data ordering question pops up every now and then (e.g., ggplot2: sorting a plot). You need to order your country factor levels by the underlying values. Your reorder command seems to sort by country name again, I would expect something like reorder(country,frequency) but sample data would help.

UPDATE: With the now provided data it becomes obvious that you need to create summary dataset:

data <- read.table("aDDs.csv",sep=",",header=T)
require(plyr)
summary <- ddply(data,.(answer),summarise,freq=length(answer))

This yields the data frame summary with one entry for each country (181 in total). Now you can do the filtering and the reordering:

threshold = quantile(summary$freq,0.9)
summary $filteredCountry = ifelse(summary$freq > threshold, summary$answer, "other")
summary$filteredCountry = reorder(summary$filteredCountry,-summary$freq)

Now you can plot:

require(ggplot2)
p=ggplot(data=summary,aes(x=filteredCountry,y=freq))
p = p+geom_bar(aes(fill=filteredCountry),stat="identity")
p
Community
  • 1
  • 1
CMichael
  • 1,856
  • 16
  • 20
  • Not successful yet. I made a new dataframe 'temp' that summarizes the answer(country) column with a maximum of 12 results. I used that new dataframe in the ifelse function to create a new column. However, something very strange happens: rather than adding the respective aDDs$answer value (a country name in my case), it adds a number (and I am not sure what this number means). If I add "string" instead of aDDs$answer, then string is copied to the respective cell. It just does not work with aDDs$answer. See the EDIT2 in the main question. – Thieme Hennis Feb 06 '14 at 12:11
  • hi @CMichael - I executed your code and although it is a different approach, it returns the same output under the summary$filteredCountry column: for TRUE it returns a value between 1 and 181 (the number of countries) and for FALSE it returns "other" (which is correct). The resulting plot also does not contain Country names, rather it contains country numbers. There just seems to be a problem with the `ifelse` function: it does not return the actual cell value, rather the row.number (or position?) – Thieme Hennis Feb 06 '14 at 13:51
  • Oh then there must be a glitch with the data type. will check later again. – CMichael Feb 06 '14 at 14:02
  • 1
    true. Found a solution in the other question I posted: something I had tried before but probably wrong `aDDs$answer <- as.character(aDDs$answer)` – Thieme Hennis Feb 06 '14 at 14:12
  • But then your top x countries question part is answered? – CMichael Feb 06 '14 at 14:16
  • using my own code yes, but I tried a few times with your code, but not yet successfully. Your code also takes the new df (without the question column) so no stacked bars are created. See my own answer. Thanks for helping and pointing me into the right direction! – Thieme Hennis Feb 06 '14 at 14:27
1

Thanks to suggestions from @CMichael and answers to another - related - post here on SO. I managed to create a stacked and ordered bar plot using ggplot:

create a list with most frequent country names

temp <- row.names(as.data.frame(summary(aDDs$answer, max=12))) # create a df or something else with the summary output.
aDDs$answer <- as.character(aDDs$answer) # IMPORTANT! Here was the problem: turn into character values

create new column that filters top results

aDDs$top <- ifelse(
        aDDs$answer %in% temp, ## condition: match aDDs$answer with row.names in summary df 
        aDDs$answer, ## then it should be named as aDDs$answer
        "Other" ## else it should be named "Other"
      )
aDDs$top <- as.factor(aDDs$top) # factorize the output again

plot

ggplot(aDDs,aes(x=
                  factor(top,
                         levels=names(sort(table(top),increasing=TRUE))
                  ),fill=question
                )
                ) + geom_bar() + coord_flip()

And here the output (still needs some tweaking, but it is what I wanted):

demo-solar

Thieme Hennis
  • 565
  • 2
  • 9
  • 20
  • 1
    :) Nice to see that it worked out - I did not realize the embedded question on the stacked barchart for the questions... But even better that you put it together yourself! – CMichael Feb 06 '14 at 15:54