0

I'm having trouble following along with an example provided by my professor. We're meant to follow along provided examples to understand the code and how the implementation goes and then do a different assignment based on topics covered in examples.

I'm having problems implementing a Scatter plot on the example. The code uses the Adult dataset from the UCI machine learning repository and has the following code.

#install.packages("ggplot2")
library(ggplot2)

#import data
adult = read.csv("adult.DATA", header = FALSE, stringsAsFactors = TRUE)
summary(adult)
colnames(adult)

#remove similar columns and rename
adult_trim = adult[,-c(3,4,11,12)]
names(adult_trim) <- c("Age", "WorkClass", "Education", "Marital.Status", "Occupation", "Relationship", "Race",
                   "Sex", "Hours.per.Week", "Native.Country", "Income")

#remove empty values & Race/NativeCountry
adult_trim <- adult_trim[rowSums(adult_trim == "?") ==0, -c(7,10), drop = FALSE]

The problem is in the following scatterplot. The data doesnt have any header values for column names so it imports as v1,v2,... etc.

adult$V4 = as.factor(as.character(adult$V4))
levels(adult$V4)
plot(
  jitter(as.numeric(adult$V4),0.5) ~ jitter(as.numeric(adult$V4), 0.5),
  data = adult_trim,
  xlab = "Income",
  ylab = "Education",
  pch = 19, 
  cex = 1, 
  bty = "n",
  xlim = c(1:2),
  col = rgb(180,0,180,30, maxColorValue = 255)
 )

When trying to implement this plot on my machine it just gives me an error.

Warning message:
In plot.formula(jitter(as.numeric(adult$V4), 0.5) ~ jitter(as.numeric(adult$V4),  :
  c("the formula 'jitter(as.numeric(adult$V4), 0.5) ~ jitter(as.numeric(adult$V4), ' 
 is treated as 'jitter(as.numeric(adult$V4), 0.5) ~ 1'", "the formula '    0.5)' 
 is treated as 'jitter(as.numeric(adult$V4), 0.5) ~ 1'")

its supposed to look like this graph but with education https://i.stack.imgur.com/EPfhX.png but I'm just getting the error. Also is there any reason this decides to use the original "adult" instead of "adult_trim" ?

Any help or explanation would be appreciated.

1 Answers1

1

Also is there any reason this decides to use the original "adult" instead of "adult_trim"?

It uses the original adult instead of adult_trim because in the jitter function you explicitly specify adult$V4. Your use of adult there overrides the data = adult_trim argument later on. With the data argument provided, you should just use the column name and rely on the data argument to point plot to the correct data frame to look in to find the column.

The problem is in the following scatterplot. The data doesnt have any header values for column names so it imports as v1,v2,... etc.

But you also show code to replace the default column names in adult_trim. After you run the line

names(adult_trim) <- c("Age", "WorkClass", "Education", "Marital.Status", "Occupation", "Relationship", "Race",
                   "Sex", "Hours.per.Week", "Native.Country", "Income")

then adult_trim has those column names, and it doesn't remember anything about V1, V2, V3, V4, etc.

When you use a formula (with ~) inside plot(), you should use yvalues ~ xvalues. You have

jitter(as.numeric(adult$V4),0.5) ~ jitter(as.numeric(adult$V4), 0.5)

which uses jitter(as.numeric(adult$V4),0.5) for both x and y values, uses the wrong data frame (overriding the data = argument), and an old column name. I would instead try

plot(
  jitter(as.numeric(Education), 0.5) ~ jitter(as.numeric(Income), 0.5),
  data = adult_trim,
  xlab = "Income",
  ylab = "Education",
  pch = 19, 
  cex = 1, 
  bty = "n",
  xlim = c(1:2),
  col = rgb(180,0,180,30, maxColorValue = 255)
 )

It's also too bad that people are still teaching beginners base plots instead of ggplot. What I'd really recommend is

library(ggplot2)
ggplot(adult_trim, aes(x = Income, y = Education)) +
  geom_point(position = "jitter", color = "hotpink3", alpha = 0.2)

And lastly, there are important differences between Warnings (which you code shows) and Errors (which you say you have, but don't). A warning means your code executed, but there may have been problems, so it warns you to check carefully. An error means that your code could not be executed - nothing was changed, you need to fix it before it will run.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Thank you I was able to get it working. If you don't mind I encountered another problem. The same code and everything is used but then it executes "adult_trim$Income_Binary = ifelse(adult_trim$Income == ">50K", 1, 0)" and just fills the vector with only 0s and doesn't do the comparison correctly. Any idea as to what it might be? By default the code has Income as levels 1 & 2 and needs to convert to binary for training the dataset – a mitchell Jun 30 '22 at 21:10
  • Numbers in R don't have quotes and letters in them. (Except for `e`, which is used for scientific notation.) "50k" is a character string, not a number, and character strings are ordered alphabetically. If you want to treat income as a number, you need to covert it to numeric. [Here's a question where someone had a similar problem](https://stackoverflow.com/q/36806215/903061). – Gregor Thomas Jul 01 '22 at 10:45