9

I have some time-series data that I'm fitting a loess curve in ggplot2, as seen attached. The data takes the shape of an "S" curve. What I really need to find out is the date where the data starts to level off, which looks to be right around time '550' or '600'

Is there some kind of quantitative way that this can be marked off in the graph?

A link to the dataset: http://dl.dropbox.com/u/75403/stover_data.txt

A dput() of the dataset:

structure(list(date = c(211L, 213L, 215L, 217L, 218L, 221L, 222L, 
223L, 224L, 225L, 226L, 228L, 229L, 230L, 231L, 232L, 233L, 234L, 
235L, 236L, 237L, 238L, 239L, 240L, 241L, 242L, 244L, 246L, 247L, 
248L, 249L, 250L, 251L, 253L, 254L, 255L, 256L, 258L, 259L, 260L, 
261L, 262L, 263L, 264L, 265L, 266L, 267L, 268L, 269L, 270L, 271L, 
272L, 273L, 274L, 275L, 276L, 277L, 278L, 279L, 281L, 282L, 283L, 
285L, 286L, 287L, 288L, 290L, 291L, 292L, 293L, 294L, 295L, 296L, 
297L, 298L, 299L, 300L, 301L, 302L, 304L, 305L, 306L, 307L, 308L, 
309L, 310L, 311L, 312L, 313L, 314L, 315L, 316L, 317L, 318L, 319L, 
320L, 321L, 322L, 323L, 324L, 325L, 326L, 327L, 328L, 329L, 330L, 
331L, 332L, 333L, 334L, 335L, 336L, 337L, 338L, 339L, 340L, 341L, 
342L, 343L, 344L, 345L, 346L, 347L, 348L, 349L, 350L, 351L, 352L, 
353L, 354L, 355L, 356L, 357L, 358L, 359L, 360L, 361L, 362L, 363L, 
364L, 365L, 366L, 367L, 368L, 369L, 370L, 371L, 372L, 373L, 374L, 
375L, 376L, 377L, 378L, 379L, 380L, 381L, 382L, 383L, 384L, 385L, 
386L, 387L, 388L, 389L, 390L, 391L, 392L, 393L, 394L, 395L, 396L, 
397L, 398L, 399L, 400L, 402L, 404L, 405L, 406L, 407L, 408L, 410L, 
411L, 413L, 414L, 415L, 416L, 418L, 419L, 420L, 421L, 422L, 423L, 
424L, 425L, 426L, 427L, 428L, 429L, 430L, 431L, 432L, 433L, 434L, 
435L, 436L, 437L, 438L, 439L, 440L, 441L, 442L, 443L, 444L, 445L, 
446L, 447L, 448L, 449L, 450L, 451L, 452L, 453L, 455L, 456L, 457L, 
458L, 459L, 460L, 461L, 462L, 463L, 464L, 465L, 466L, 467L, 468L, 
469L, 470L, 471L, 472L, 473L, 474L, 475L, 476L, 477L, 478L, 479L, 
480L, 481L, 482L, 483L, 484L, 485L, 486L, 487L, 488L, 489L, 490L, 
491L, 492L, 493L, 494L, 495L, 496L, 497L, 498L, 499L, 500L, 501L, 
502L, 503L, 504L, 505L, 506L, 507L, 508L, 509L, 510L, 511L, 512L, 
513L, 514L, 515L, 516L, 517L, 518L, 519L, 520L, 521L, 522L, 523L, 
524L, 527L, 528L, 529L, 530L, 531L, 532L, 533L, 534L, 535L, 536L, 
537L, 538L, 539L, 540L, 541L, 544L, 545L, 546L, 547L, 548L, 549L, 
550L, 551L, 552L, 553L, 554L, 555L, 556L, 557L, 558L, 559L, 560L, 
561L, 562L, 563L, 564L, 565L, 566L, 567L, 568L, 569L, 570L, 571L, 
572L, 573L, 574L, 575L, 576L, 577L, 578L, 579L, 580L, 581L, 582L, 
583L, 587L, 588L, 589L, 590L, 591L, 592L, 593L, 594L, 595L, 596L, 
597L, 598L, 599L, 600L, 601L, 602L, 603L, 604L, 605L, 606L, 607L, 
608L, 609L, 610L, 611L, 612L, 613L, 614L, 615L, 616L, 617L, 618L, 
619L, 620L, 621L, 622L, 623L, 624L, 625L, 626L, 627L, 628L, 629L, 
630L, 631L, 632L, 634L, 635L, 636L, 637L, 638L, 639L, 640L, 641L, 
642L, 643L, 644L, 645L, 646L, 647L, 648L, 649L, 650L, 651L, 652L, 
653L, 654L, 655L, 656L, 657L, 658L, 659L, 660L, 661L, 662L, 663L, 
664L, 665L, 666L, 667L, 668L, 669L, 670L, 671L, 672L, 673L, 674L, 
675L, 676L, 677L, 678L, 679L, 680L, 681L, 684L, 685L, 686L, 687L, 
688L, 689L, 690L, 691L, 692L, 693L, 694L, 695L, 696L, 697L, 698L, 
699L, 700L, 701L, 702L, 703L, 704L, 705L, 706L, 707L, 708L, 709L, 
710L, 711L, 712L, 713L, 714L, 715L, 716L, 717L, 718L, 719L, 720L, 
721L, 722L, 723L, 724L, 725L, 726L, 727L, 728L, 729L, 730L, 731L, 
732L, 733L, 734L, 735L, 736L, 737L, 738L, 739L, 740L, 741L, 742L, 
743L, 744L, 745L, 746L, 747L, 748L, 749L, 750L, 751L, 752L, 753L, 
754L, 755L, 756L, 757L, 758L, 759L, 760L, 761L, 762L, 763L, 764L, 
765L, 766L, 767L, 768L, 769L, 770L, 771L, 772L, 773L, 774L, 775L, 
776L, 777L, 778L, 781L, 782L, 783L, 784L, 785L, 786L, 787L, 788L, 
789L, 790L, 791L, 792L, 793L, 794L, 795L, 796L, 797L, 798L, 799L, 
800L, 801L, 802L, 803L, 804L, 805L, 806L, 807L, 808L, 809L, 810L, 
811L, 812L, 813L, 814L, 815L, 816L, 817L, 818L, 819L, 820L, 821L, 
822L, 823L, 824L, 825L, 826L, 827L, 828L, 829L, 830L, 831L, 832L, 
833L, 834L, 835L, 836L, 837L, 838L, 839L, 840L, 841L), org_count = c(2L, 
1L, 3L, 1L, 1L, 1L, 2L, 1L, 1L, 3L, 2L, 5L, 3L, 2L, 1L, 4L, 1L, 
1L, 10L, 10L, 4L, 5L, 4L, 1L, 2L, 2L, 1L, 1L, 3L, 1L, 1L, 2L, 
1L, 3L, 6L, 4L, 2L, 1L, 3L, 1L, 2L, 4L, 4L, 6L, 3L, 2L, 6L, 12L, 
13L, 14L, 8L, 7L, 5L, 11L, 11L, 1L, 40L, 13L, 1L, 2L, 4L, 2L, 
5L, 2L, 1L, 2L, 3L, 5L, 1L, 3L, 4L, 1L, 4L, 7L, 12L, 3L, 3L, 
2L, 2L, 2L, 2L, 2L, 3L, 4L, 2L, 5L, 6L, 4L, 5L, 6L, 3L, 6L, 4L, 
16L, 79L, 61L, 31L, 43L, 40L, 38L, 25L, 22L, 29L, 22L, 5L, 6L, 
11L, 6L, 6L, 8L, 7L, 4L, 7L, 11L, 4L, 18L, 10L, 13L, 10L, 8L, 
12L, 14L, 11L, 22L, 13L, 16L, 16L, 6L, 5L, 11L, 17L, 11L, 11L, 
16L, 15L, 13L, 16L, 15L, 12L, 16L, 14L, 9L, 15L, 18L, 20L, 13L, 
15L, 21L, 16L, 6L, 22L, 20L, 13L, 19L, 15L, 23L, 19L, 18L, 21L, 
21L, 12L, 15L, 41L, 26L, 14L, 12L, 11L, 11L, 9L, 9L, 8L, 7L, 
5L, 2L, 7L, 6L, 2L, 3L, 4L, 2L, 2L, 1L, 7L, 3L, 3L, 4L, 2L, 3L, 
1L, 2L, 1L, 2L, 2L, 2L, 6L, 5L, 7L, 8L, 6L, 5L, 8L, 6L, 5L, 5L, 
4L, 4L, 8L, 5L, 3L, 6L, 6L, 6L, 6L, 5L, 6L, 4L, 1L, 4L, 2L, 5L, 
1L, 2L, 1L, 1L, 1L, 2L, 3L, 5L, 1L, 1L, 3L, 3L, 4L, 3L, 4L, 6L, 
6L, 1L, 2L, 3L, 6L, 4L, 7L, 17L, 6L, 5L, 2L, 4L, 6L, 8L, 1L, 
3L, 2L, 4L, 4L, 2L, 3L, 4L, 3L, 3L, 7L, 9L, 6L, 14L, 12L, 12L, 
6L, 15L, 33L, 19L, 13L, 17L, 12L, 16L, 10L, 7L, 7L, 6L, 20L, 
20L, 8L, 14L, 9L, 22L, 21L, 6L, 6L, 8L, 54L, 44L, 22L, 21L, 14L, 
13L, 64L, 34L, 26L, 21L, 61L, 43L, 47L, 42L, 37L, 57L, 46L, 38L, 
33L, 32L, 51L, 76L, 36L, 31L, 45L, 35L, 27L, 17L, 17L, 12L, 7L, 
77L, 69L, 18L, 28L, 37L, 35L, 40L, 47L, 36L, 37L, 33L, 17L, 24L, 
13L, 19L, 28L, 22L, 27L, 49L, 37L, 25L, 30L, 35L, 20L, 16L, 20L, 
10L, 15L, 67L, 35L, 32L, 28L, 48L, 66L, 76L, 68L, 38L, 16L, 18L, 
37L, 29L, 37L, 53L, 31L, 30L, 20L, 48L, 36L, 35L, 31L, 33L, 16L, 
13L, 32L, 56L, 47L, 32L, 39L, 20L, 27L, 53L, 62L, 60L, 49L, 41L, 
17L, 25L, 26L, 42L, 33L, 48L, 34L, 25L, 24L, 51L, 31L, 44L, 37L, 
27L, 17L, 35L, 32L, 34L, 28L, 28L, 28L, 28L, 53L, 48L, 58L, 49L, 
25L, 25L, 34L, 33L, 63L, 75L, 112L, 74L, 29L, 36L, 36L, 42L, 
42L, 44L, 49L, 16L, 24L, 27L, 47L, 40L, 37L, 33L, 13L, 25L, 31L, 
45L, 40L, 53L, 51L, 30L, 41L, 43L, 60L, 46L, 39L, 24L, 39L, 48L, 
59L, 43L, 71L, 31L, 21L, 37L, 45L, 41L, 45L, 34L, 19L, 19L, 25L, 
45L, 40L, 28L, 33L, 19L, 25L, 25L, 31L, 25L, 29L, 31L, 30L, 27L, 
40L, 31L, 25L, 42L, 29L, 18L, 11L, 27L, 34L, 35L, 59L, 32L, 23L, 
22L, 29L, 38L, 39L, 35L, 47L, 21L, 16L, 33L, 22L, 15L, 18L, 16L, 
20L, 16L, 36L, 44L, 58L, 35L, 21L, 20L, 14L, 55L, 34L, 30L, 40L, 
27L, 34L, 31L, 47L, 53L, 42L, 59L, 55L, 41L, 43L, 29L, 26L, 32L, 
40L, 33L, 28L, 27L, 47L, 40L, 52L, 48L, 58L, 38L, 35L, 29L, 37L, 
19L, 19L, 22L, 15L, 16L, 21L, 31L, 25L, 31L, 23L, 32L, 30L, 80L, 
45L, 49L, 32L, 18L, 29L, 35L, 23L, 27L, 21L, 21L, 29L, 43L, 106L, 
58L, 117L, 49L, 28L, 24L, 43L, 49L, 34L, 23L, 28L, 16L, 21L, 
45L, 37L, 29L, 32L, 26L, 16L, 18L, 26L, 24L, 21L, 18L, 16L, 23L, 
10L, 19L, 24L, 29L, 11L, 26L, 15L, 14L, 19L)), .Names = c("date", 
"org_count"), class = "data.frame", row.names = c(NA, -599L))

Graph: enter image description here

Code:

> p<-qplot(date,org_count, data=christi)

> p+stat_smooth(method="loess",size=1.5)
Uwe
  • 41,420
  • 11
  • 90
  • 134
user1062293
  • 133
  • 3
  • 7
  • This question might be better suited for [CrossValidated](http://stats.stackexchange.com). You might even be able to find an existing answer there - [here's a related question I asked](http://stats.stackexchange.com/questions/173/time-series-for-count-data-with-counts-20), and [searching for the R package strucchange](http://stats.stackexchange.com/search?q=strucchange) and reviewing the [structural-change](http://stats.stackexchange.com/questions/tagged/structural-change) and [change-point](http://stats.stackexchange.com/questions/tagged/change-point) tags might prove helpful. – Matt Parker Nov 23 '11 at 16:45
  • I think this is more an R question than a stats question, no? He's not asking if there is a structural break; he's interested in descriptive purposes. – Xu Wang Nov 23 '11 at 16:50

2 Answers2

7

If you are asking for a way of determining the point where the curve is a maximum (i.e. flat), this is the same as finding the point where the slope of the line is at its maximum (from basic calculus).

First, read your data:

christi <- read.table("http://dl.dropbox.com/u/75403/stover_data.txt", sep="\t", header=TRUE)

Next, use loess to fit a smoothed model:

fit <- loess(org_count~date, data=christi)

Then, predict the values in your range of x-values (with predict.loess), determine the slope (diff is close enough`), and find the

x <- 200:800
px <- predict(fit, newdata=x)
px1 <- diff(px)

which.max(px1)
[1] 367

Since the start value of x is 200, this means the curve is flat at position 200+367=567.


If you wanted to plot this:

par(mfrow=c(1, 2))
plot(x, px, main="loess model")

plot(x[-1], px1, main="diff(loess model)")
abline(v=567, col="red")

enter image description here

Andrie
  • 176,377
  • 47
  • 447
  • 496
4

It all depends on what you mean by "where the data starts to level off". You need to put this into math. LOESS curves can be really bumpy depending on what bandwidth you use. You might want to modify the line below the comment marked "line A" to specify what you mean. For example, you might want to look at more than just the first previous value. You could, for example, look at the sum of the previous 5 the_diff values.

library(ggplot2)
christi <- read.table("stover_data.txt",header=TRUE)
the_fit  = loess(org_count ~ date, data=christi)
pred = predict(the_fit, christi, se=FALSE) #could change data with larger grid
with_loess <- cbind(christi,pred)

p<-qplot(date,org_count, data=christi)
the_plot <- p+stat_smooth(method="loess",size=1.5)

the_diff <- diff(with_loess$pred)

tolerance <- .1

#line A: the following line is what you want to modify.
vline <- min(with_loess$date[the_diff < -tolerance])
new_plot <- the_plot + geom_vline(xintercept=vline)

plot with vline

for example, you could do the following (replace the last part of the above code starting with the the_diff line)

the_diff <- diff(with_loess$pred, lag=20)

tolerance <- 1
vline <- min(with_loess$date[the_diff < -tolerance])
new_plot <- the_plot + geom_vline(xintercept=vline)

enter image description here

Also note that you might want to shift the the_diff vector depending on what you mean by "start to level off" (ie in the future it's going to level off, or it's already starting to level off, etc.). Also remember that the_diff is a shorter by the length of its lag argument than your dataset.

Xu Wang
  • 10,199
  • 6
  • 44
  • 78
  • Ideally, I'd like to be able to demarcate just before the slope of the loess curve is 0. So a line right around date=600. I'm a little ignorant of timeseries though. So lag is the number of days back the data will work and what does tolerance affect? – user1062293 Nov 23 '11 at 19:28
  • @user1062293 tolerance is trying to turn into math what you mean by "starts to level off" because "starts to level off" is not mathematically defined. You don't have to know anything about timeseries. `lag` is the number of days back it differences. So if lag is 20, then diff subtracts, say, the value corresponding to day 430 from the value corresponding to day 450. This is needed because LOESS is very bumpy to it goes up and down a lot. If you put lines whenever the LOESS curve went down, you would have lines all over the place! – Xu Wang Nov 24 '11 at 02:01
  • If you just want to put a line at date=600, change `geom_vline(xintercept=vline)` to `geom_vline(xintercept=600)`. Does the above explanation make sense? – Xu Wang Nov 24 '11 at 02:02
  • I think you'd be better off switching to some sort of change point model, e.g. [strucchange](http://cran.r-project.org/web/packages/strucchange/index.html) – hadley Nov 24 '11 at 02:20