1

Suppose I have the following data set collected from a hypothetical survey:

name    age    homeowner    favorite_color    pets
Bill    45     Yes          Blue              (cat, dog, fish)
Mary    33     Yes          Red               (cat, dog)
Joe     55     Yes          Blue              (cat, bird, fish)
Sue     38     No           Green             (fish, bird)

Where each person is able to provide multiple responses to the type of pets that they own.

Is there an easy way to create a scatterplot of the following with ggplot2?

x axis = homeowner
y axis = favorite_color
col = pets

Essentially, I'm looking to plot three categorical values. I'm having trouble trying to figure out how to best extract the nested vector data for pets. For the sake of simplicity, let's say they are only allowed to have one of each kind of pet.

At the intersection of (Yes, blue), I am looking to see a jittered plot with:

  • 2 points for cat in the same color
  • 2 points for fish in the same color
  • 1 point for bird
  • 1 point for fish

Any help you can provide here would be very much appreciated - very new to r.

lmo
  • 37,904
  • 9
  • 56
  • 69
CJW
  • 13
  • 2
  • I'm thinking along the lines of `grep` to extract the pets into their own columns, then `reshape` to merge them into one column, so each person would have several rows, each with one pet. – Dan Slone Mar 13 '17 at 19:25
  • 1
    Welcome to StackOverflow. Please take a look at these tips on how to produce a [minimum, complete, and verifiable example](http://stackoverflow.com/help/mcve), as well as this post on [creating a great example in R](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – lmo Mar 13 '17 at 19:25
  • If it's actually a nested `list` column, then `tidyr::unnest` works well. If it's just a string, then I would `tidyr::separate` it and then `gather` it into long format. Please share some data reproducibly (see lmo's links) if you need more help. – Gregor Thomas Mar 13 '17 at 19:28
  • favorite color was arbitrarily chosen for y axis - could be any categorical string. – CJW Mar 13 '17 at 19:38

1 Answers1

0
survey <- data.frame(name = c("Bill", "Mary", "Joe", "Sue"),
                 age = c(45, 33, 55, 38),
                 homeowner = c(rep("Yes", times = 3), "No"),
                 favorite_color = c("Blue", "Red", "Blue", "Green"),
                 pets = c("(cat, dog, fish)",
                          "(cat, dog)",
                          "(cat, bird, fish)",
                          "(fish, bird)"))
# Rebuild your data

all_pets <- c("cat", "dog", "fish", "bird")
# Specify all kinds of pets you have (Someone else may have a better way here)

name <- NULL
pets <- NULL
for (i in 1:nrow(survey)) {
  for (j in 1:length(all_pets)) {
    if (grepl(all_pets[j], survey$pets[i])) {
      name <- append(name, as.character(survey$name[i]))
      pets <- append(pets, all_pets[j])
    }
  }
}
new_survey <- data.frame(name, pets)
merged_survey <- merge(survey, new_survey, by = "name")

Now merged_survey should have the information you need. Now we can plot it with ggplot2.

require(ggplot2)
g <- ggplot(aes(x = homeowner, y = favorite_color), data = merged_survey)
g + geom_point(aes(color = pets.y), position = position_jitter(0.1, 0.1))

enter image description here

The position_jitter function jitters the points randomly every time, so you may not see the points at exactly the same positions with me. You can adjust the jitter width and height by changing the numbers in position_jitter. All labels can be changed later but it could be off-topic here.

ytu
  • 1,822
  • 3
  • 19
  • 42