Scatterplot showing straight lines instead of scatter, how to fix?

Question

I am working on analyzing my data set - which is race, occupation, and income data in the Philadelphia region.

I was hoping to use ggplot to do some various data visualizations, but I am having serious trouble even getting a single one to look normal. Every single plot looks incredibly crowded. I am doing something wrong. Maybe with ggplot, maybe with the factoring, but I am not sure.

This is my latest one, an attempted scatterplot.

ggplot(cps_data2, aes(x = INCWAGE_factor,
                      y = RACE_factor)) + 
xlab('Individual Income') + 
ylab('Race') +
geom_point()

That gives me, this:

Here's my data set information. (See example of how I factored my variables).

cps_data2<-cps_data2 %>%
  mutate(INCWAGE_factor = as_factor(INCWAGE))

$ RACE_factor   : Factor w/ 9 levels "White","Black/African American/Negro",..: 1 1 2 2 1 8 2 1 1 2 ...
  ..- attr(*, "label")= chr "Race [general version]"

$ OCC_factor    : Factor w/ 429 levels "0","10","20",..: 42 302 1 22 254 291 1 112 418 1 ...
  ..- attr(*, "label")= chr "Occupation"

$ INCWAGE_factor: Factor w/ 654 levels "0","20","50",..: 521 283 1 529 328 311 1 1 283 1 ...
  ..- attr(*, "label")= chr "Wage and salary income"

$ SEX_factor    : Factor w/ 2 levels "Male","Female": 2 1 2 1 2 1 1 2 1 2 ...
  ..- attr(*, "label")= chr "Sex"

$ CITY_factor   : Factor w/ 1157 levels "Not in identifiable city (or size group)",..: 814 814 814 814 814 814 814 814 814 814 ...
  ..- attr(*, "label")= chr "City"

$ AGE_factor    : Factor w/ 46 levels "Less than 1 year old",..: 14 12 37 18 35 14 39 41 37 36 ...
  ..- attr(*, "label")= chr "Age"

The plot looks as expected given that every variable is a factor. [It would help](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) to see some or all of `cps_data2` and to have an idea of the desired output. — neilfws, Apr 08 '19 at 22:42
It might be useful to look at `geom_jitter`, `geom_boxplot`, `geom_density`, the `ggridges` or `ggbeeswarm` packages to compare the distributions of these groups. — Jon Spring, Apr 08 '19 at 22:50

Peter_Evan · Answer 1 · 2019-04-08T22:56:57.293

0

Both your X and Y variables are factors, which are categorical. The y axis is as you would expect for a categorical variable such as race.

One improvement is to change your x axis variable to be numeric: cps_data2$INCWAGE_factor <- as.numeric(as.character(cps_data2$INCWAGE_factor))

If you want to see your points more clearly you should look at geom_jitter(), which will add an arbitrary amount of noise to your data for graphing. See below:

library(tidyverse)

#toy data
z <- data.frame(x = rep(c('one','two','one','two'),50), y = rnorm(1,50))

#without jitter
ggplot(z, aes(x,y)) + geom_point()

#with jitter
ggplot(z, aes(x,y)) + geom_point() + geom_jitter()

No Jitter With Jitter

Of course there are probably other ways to explore your data when a categorical variable in involved. As others have noted, box and whisker plots are common geom_boxplot()

edited Apr 08 '19 at 22:56

answered Apr 08 '19 at 22:49

Peter_Evan

947
10
17

I factored my variables because prior to doing so, I got this error: Don't know how to automatically pick scale for object of type haven_labelled. Defaulting to continuous. I also got this plot, which also seems wrong? https://imgur.com/a/tdr5aq8. I guess I am just having trouble figuring out which plot is best to compare distribution (I started typing as the other responses rolled in - I will try those out, thanks) – rhelpless Apr 08 '19 at 23:03
If you are using `haven` to import data from SAS, STATA, etc. I would encourage you to first try and import directly with `read_sas()` for SAS, `read_dat()` for STATA, etc. You may find your data imports OK with minimal tidying to be down afterwords. This would forgo using the `labelled` function, which can throw errors like the one you are seeing (but I am guessing here). Honestly, thought, I would likely use something like the `ggridges` function @Jon Spring has shown. – Peter_Evan Apr 08 '19 at 23:20

Jon Spring · Accepted Answer · 2019-04-08T23:11:50.760

Here's a similar plot to yours, presuming that mpg is a numeric variable:

ggplot(mtcars, aes(mpg, as.factor(cyl))) + 
  geom_point()

Here's an approach that might work for you, using the ggridges package.

ggplot(mtcars, aes(mpg, as.factor(cyl))) + 
  ggridges::geom_density_ridges()

Here's an approach using geom_boxplot, which is typically oriented vertically and needs to be flipped around here.

ggplot(mtcars, aes(as.factor(cyl), mpg)) + geom_boxplot() + coord_flip()

score 0 · Answer 3 · answered Apr 09 '19 at 10:40

Sometimes a simple visual is just to add transparency to the points to give an idea of density: your code with additional arguments.

ggplot(cps_data2, aes(x = INCWAGE_factor,
                      y = RACE_factor)) + 
xlab('Individual Income') + 
ylab('Race') +
geom_point(shape = 1, alpha = 0.2)

Scatterplot showing straight lines instead of scatter, how to fix?

3 Answers3