Time Series Plot reproducing incorrect values in R

Question

I am attempting to plot a simple time series plot of the dataset found here:https://datamarket.com/data/set/22qf/monthly-champagne-sales-in-1000s-p273-montgomery-fore-ts#!ds=22qf&display=line

Here is the code the code that I am using:

>setwd("~/Desktop")
>sales<- read.csv("~/Desktop/monthly-champagne-sales-in-1000s.csv", header=FALSE)
>attach(sales)
>msale<-ts(sales, frequency=12, start=c(1950,1))
>plot(msale)
>plot<-ts(V1,V2)

Both of my attempts to plot the time series below have failed because the sales column shows sale yields in the 200-5000 area. In my attempt to plot the time series above, R is printing out values between 5-80. I figured something was wrong with the sales column dataset so when I printed the following in the console

>View(sales$V2)

the results yielded this:

structure(c(23L, 20L, 22L, 21L, 27L, 31L, 16L, 15L, 25L, 60L, 81L, 88L, 18L, 17L, 30L, 37L, 46L, 35L, 29L, 13L, 41L, 61L, 84L, 91L, 33L, 28L, 53L, 39L, 48L, 51L, 36L, 8L, 40L, 76L, 89L, 92L, 79L, 32L, 44L, 63L, 64L, 65L, 43L, 9L, 70L, 80L, 90L, 2L, 42L, 59L, 55L, 54L, 67L, 71L, 50L, 11L, 75L, 86L, 95L, 4L, 52L, 49L, 62L, 57L, 73L, 69L, 39L, 14L, 78L, 85L, 3L, 7L, 19L, 24L, 38L, 45L, 26L, 51L, 56L, 12L, 77L, 83L, 93L, 6L, 47L, 34L, 58L, 68L, 74L, 72L, 66L, 10L, 82L, 87L, 94L, 5L), .Label = c("", "10651", "10803", "11331", "12670", "13076", "13916", "1573", "1643", "1659", "1723", "1738", "1759", "1821", "2212", "2282", "2475", "2541", "2639", "2672", "2721", "2755", "2851", "2899", "2922", "2927", "2946", "3006", "3028", "3031", "3036", "3088", "3113", "3162", "3230", "3260", "3266", "3370", "3523", "3528", "3595", "3633", "3663", "3718", "3740", "3776", "3934", "3937", "3957", "3965", "3986", "4016", "4047", "4121", "4154", "4217", "4276", "4286", "4292", "4301", "4474", "4510", "4514", "4520", "4539", "4633", "4647", "4676", "4677", "4739", "4753", "4874", "4968", "5010", "5048", "5211", "5221", "5222", "5375", "5428", "5764", "5951", "6424", "6838", "6873", "6922", "6981", "7132", "7614", "8314", "8357", "9254", "9842", "9851", "9858", "Monthly champagne sales (in 1000's) (p.273: Montgomery: Fore. & T.S.)"), class = "factor")

Can someone explain what is structure(c and why it is skewing the data so extensively?

@karthikbharadwaj sorry, I am new to this site. What do you mean by that? — chaimocha, Jun 02 '16 at 23:24
Your numbers are not numbers, but a factor because there's a text entry in one of the levels, maybe because there actually was a header. Convert to an integer vector with `as.integer(as.character(x))` or fix the original issue. — alistaire, Jun 02 '16 at 23:26
http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example?rq=1 — karthikbharadwaj, Jun 02 '16 at 23:30
@alistaire okay I removed the original lines with the text headers using sales=sales[-1,] sales=sales[-nrow(sales),] And I did as.integer(as.character(V2)) and it printed out a list of numerical values with the last one being NA. But the plots still show up the same. Did I make another error? — chaimocha, Jun 02 '16 at 23:40
Are you plotting the integers you created? Wrapped into one, it should be something like `plot(ts(as.integer(as.character(sales$V2)), frequency=12, start=c(1950,1)))` — alistaire, Jun 02 '16 at 23:45
@karthikbharadwaj I'm still not sure what needs to be fixed. I pasted the code exactly as I have it in R. — chaimocha, Jun 02 '16 at 23:46
@alistaire Thank you! I thought to code you gave me changed the actaul column in R. This worked perfectly. Sorry, I am very new to R. If you don't mind me asking another question, is there anyway that I could change the values in the actually dataset "sales" so that I could run the time series plot using the msale line in my above code? — chaimocha, Jun 02 '16 at 23:49
The problem seems to be in read.csv(). Have a good read of this: https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html and look closely at colClasses. You might want to give R hints about what it's importing. Something like: read.table("your file name", sep=",", headers=FALSE, colClasses=c("text", "numeric")) with the colClasses terms set up for your data. — Jason, Jun 02 '16 at 23:53
Also: when read.csv() starts giving me headaches, I sometimes switch to read.table(). It does less smart things for you, but since it runs with fewer assumptions, there are fewer bugs caused by wrong assumptions. — Jason, Jun 02 '16 at 23:57

score 0 · Accepted Answer · answered Jun 03 '16 at 00:00

If you look at the CSV file, there's an extra row at the bottom with labels that is causing your data to get read in as character (and therefore factor) instead of integer. Set nrows in read.csv to stop it from including that row.

df <- read.csv('monthly-champagne-sales-in-1000s.csv', nrows = 96)

# clean up names
names(df) <- c('month', 'sales')

# plot with something like
plot(ts(df$sales, frequency=12, start=c(1950,1)), ylab = 'sales')

Time Series Plot reproducing incorrect values in R

1 Answers1