Where is this extra data coming from? (in R plot)

Question

I'm a biologist, but I had to teach myself python and R working different places a few years ago. A situation came up at my current job that R would be really useful for, and so i cobbled together a program. Surprisingly, it does just what I'd like EXCEPT the graphs it's generating have an extra bar at the beginning. !

I've entered no data to correspond to that first bar:

I've entered no data to correspond to that first bar

I'm hoping this is some simple error in how I've set the plot parameters. Could it be because I'm using plot instead of boxplot? Is it plotting the headings? More worrisome is the possibility that while reading in and merging my 3 data frames I'm creating some sort of artifact data, which would also affect the statistical tests and make me very sad, though I don't see anything like this when I have it write the matrix to a file. I greatly appreciate any help!

Here's what it looks like, and then the function it calls (in another script). (I'm really not a programmer, so I apologize if the following code is miserable.) The goal is to compare our data (which is in columns 10-17 of a csv) to all of the data in a big sheet of clinical data in turn. Then, if there is a significant correlation (the p value is less than .05), to graph the two against each other. This gives me a fast way to find if there's something worth looking further into in this big data set.

first <- read.csv(labdata)
second <- read.csv(mrntoimacskey)
third <- read.csv(imacsdata)
firsthalf<-merge(first,second)
mp <-merge(firsthalf, third, by="PATIENTIDNUMBER")

setwd(aplaceforus)
pfile2<- sprintf("%spvalues", todayis)
setwd("fulldataset")
for (m in 10:17) {
n<-m-9
pretty= pretties[n]
for (i in 1:length(colnames(mp))) {
tryCatch(sigsearchA(pfile2,mp, m, i, crayon=pretty), error= function(e)
{cat("ERROR :", conditionMessage(e), "\n")})
tryCatch(sigsearchC(pfile2,mp, m, i, crayon=pretty), error= function(e)
{cat("ERROR :", conditionMessage(e), "\n")})
}
}

sigsearchA<-function(n, mp, y, x, crayon="deepskyblue"){
#anova, plots if significant. takes name of file, name of database, 
#and the count of the columns to use for x and y
stat<-oneway.test(mp[[y]]~mp[[x]])
pval<-stat[3]
heads<-colnames(mp)
a<-heads[y]
b<-heads[x]
ps<-c(a, b, pval)
write.table(ps, file=n, append= TRUE, sep =",", col.names=FALSE)
feedback<- paste(c("Added", b, "to", n), collapse=" ")
if (pval <= 0.05 & pval>0) {
#horizontal lables
callit<-paste(c(a,b,".pdf"), collapse="")
val<-sprintf("p=%.5f", pval)
pdf(callit)
plot(mp[[x]], mp[[y]], ylab=a, main=b, col=crayon)
mtext(val, adj=1)
dev.off()
#with vertical lables, in case of many groups
callit<-paste(c(a,b,"V.pdf"), collapse="")
pdf(callit)
plot(mp[[x]], mp[[y]], ylab=a, main=b,las=2,cex.axis=0.7, col=crayon)
mtext(val, adj=1)
dev.off()
}
print(feedback) }


graphics.off()

Welcome to Stack Overflow! Can you please include data that will provide us with a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) ? — Ben Bolker, Jun 24 '16 at 16:13
Several sets do have NA values. would they appear unlabeled like that? — labrat, Jun 24 '16 at 16:21
If i told it to replace NA with NULL, would that be any better? If I outright deleted them, wouldn't that make it compare lists of different lengths without preserving the positions of the data (and thus compare one patient's age to different patient's test instead of his own)? — labrat, Jun 24 '16 at 16:34
Could there be one value of the x variable equal to an empty string (`""`) or a space character (`" "`)? — eipi10, Jun 24 '16 at 17:09
So, without blanks, NA and nulls, the bar goes away! But I get the same blank space on the axis. Is that information diagnostic for anybody? — labrat, Jun 24 '16 at 20:09

eipi10 · Answer 1 · 2016-06-27T14:41:24.897

I can't be absolutely certain without a reproducible example, but it looks like the x-variable in your plot (let's call it x and let's assume your data frame is called df) has at least one row with an empty string ("") or maybe a space character (" ") and x is also coded as a factor. Even if you remove all of the "" values from the data frame, the level for that value will still be part of the factor coding and will show up in plots. To remove the level, do df$x = droplevels(df$x) and then run your plot again.

For illustration, here's an analogous example with the built-in iris data frame:

# Shows that Species is coded as a factor
str(iris)

# Species is a factor with three levels
levels(iris$Species)

# There are 50 rows for each level of Species
table(iris$Species)

# Three boxplots, one for each level of Species
boxplot(iris$Sepal.Width ~ iris$Species)

# Now let's remove all the rows with Species = "setosa"
iris = iris[iris$Species != "setosa",]

# The "setosa" rows are gone, but the factor level remains and shows up
#  in the table and the boxplot
levels(iris$Species)
table(iris$Species)    
boxplot(iris$Sepal.Width ~ iris$Species)

# Remove empty levels
iris$Species = droplevels(iris$Species)

# Now the "setosa" level is gone from all plots and summaries
levels(iris$Species)
table(iris$Species)
boxplot(iris$Sepal.Width ~ iris$Species)

Thank you! I tried this but had the same problem. In the end, I think the trouble was some cells were blank and some had NA. I fixed it by manually replacing all blanks with NA in the original spreadsheets. — labrat, Jul 01 '16 at 14:13

Where is this extra data coming from? (in R plot)

1 Answers1