1

I'm performing some processing of rather large amounts of data. I did several tests with some constant number of records (1 million, 10 millions and 100 millions) and measured execution time with time(1). So, I have the following CSV with results (the columns are the following: number of records, extra processing, elapsed time, user time, sys time):

1000000,false,4.29,13.62,0.48
1000000,true,8.78,28.28,0.89
10000000,false,69.17,229.20,8.26
10000000,true,106.89,343.34,11.78
100000000,false,1053.46,3058.38,126.66
100000000,true,1255.68,4011.54,143.87
1000000,false,8.40,27.86,1.01
1000000,true,12.59,40.75,1.44
10000000,false,92.84,309.81,10.85
10000000,true,125.52,410.81,14.06
100000000,false,963.49,2935.52,116.03
100000000,true,1435.18,4238.75,154.30
1000000,false,9.12,29.94,1.14
1000000,true,12.90,42.21,1.48
10000000,false,96.32,321.50,11.65
10000000,true,122.68,400.36,13.92
100000000,false,872.66,2876.10,109.40
100000000,true,1170.53,3771.05,131.80
1000000,false,11.07,36.70,1.28
1000000,true,13.21,43.15,1.44
10000000,false,94.08,312.17,11.42
10000000,true,126.83,411.92,14.10
100000000,false,870.20,2861.60,109.60
100000000,true,1138.72,3692.30,127.56
1000000,false,8.60,28.48,1.04
1000000,true,13.14,42.88,1.48
10000000,false,87.76,290.91,10.50
10000000,true,118.03,382.60,12.80
100000000,false,858.91,2822.96,106.71
100000000,true,1190.48,3857.58,133.79
1000000,false,8.91,29.59,1.00
1000000,true,12.91,42.01,1.55
10000000,false,89.62,296.94,11.00
10000000,true,116.50,378.21,12.77
100000000,false,870.43,2858.22,109.46
100000000,true,1126.05,3641.41,127.34
1000000,false,9.46,31.40,1.20
1000000,true,11.12,36.28,1.17
10000000,false,87.26,289.12,10.78
10000000,true,115.46,372.48,12.70
100000000,false,1044.48,3029.55,121.52
100000000,true,1393.75,4083.24,147.38
1000000,false,9.75,30.62,1.24
1000000,true,14.79,45.33,1.52
10000000,false,99.32,317.52,12.20
10000000,true,150.65,428.98,16.02
100000000,false,916.92,2979.20,115.72
100000000,true,1119.58,3619.34,126.22
1000000,false,8.85,29.42,1.04
1000000,true,12.47,40.42,1.40
10000000,false,94.12,312.18,11.27
10000000,true,121.16,393.87,13.56
100000000,false,884.21,2898.08,110.16
100000000,true,1131.85,3655.16,128.92
1000000,false,8.86,29.51,1.08
1000000,true,12.32,40.12,1.21
10000000,false,89.75,298.62,10.80
10000000,true,114.46,371.82,12.69
100000000,false,868.67,2842.56,109.55
100000000,true,1139.24,3680.05,127.93

How can I predict the time to process, for example, a billion of records? I'm going to use R to have the ability to visualize data as well.

Petr Razumov
  • 1,952
  • 2
  • 17
  • 32
  • @ZheyuanLi: "On a multi-processor machine, a multi-threaded process or a process forking children could have an elapsed time smaller than the total CPU time - as different threads or processes may run in parallel." http://stackoverflow.com/a/556411/3656424 – Petr Razumov Jun 26 '16 at 11:52
  • @ZheyuanLi Oh, I haven't thought that was important. But if it actually is I do data processing in [Golang](https://golang.org/) using [goroutines](https://golang.org/doc/effective_go.html#goroutines). – Petr Razumov Jun 26 '16 at 12:08
  • The question belongs to stats.stackexchange.com – user31264 Jun 26 '16 at 12:32

1 Answers1

1

There is nothing to predict, using your current data. Although your have many observations, they are only collected on 3 unique problem size: 1 million, 10 millions and 100 millions.

Your data, when plotted, are:

enter image description here

We need a regression model, to make prediction. But with such data, it is not possible to do this reliably. You need to collect data on more problem size, like 1, 2, 3, 4, 5 ...., 99, 100 millions. For each size, collect data with / without extra processing. Only this, we can estimate how the processing time grows with your problem size. For example, is it linear growth, quadratic growth?

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
  • But how could I make a prediction when I collect more data? Sorry, my experience in data science/R is very poor. – Petr Razumov Jun 26 '16 at 11:58
  • As ZheyuanLi suggested, you need samples at regular intervals across x. First, you need to determine the resolution. Should you be sampling every hundredth data point, every millionth or max/10? Once you have figured out the sample rate you need to figure out the model. First, use visual inspection of a scatter plot. Second, try to fit an appropriate model to the curve. At this point you will get more relevant help over at http://stats.stackexchange.com/ – noumenal Jun 26 '16 at 12:16
  • @ZheyuanLi Ok, I'm going to update my tests running script to gather more data. You said that my data is not well designed. What do you mean by that? It lacks data for 1, 2, 3, 4, 5 ...., 99, 100 millions of records, as you noted in your answer, or is there anything else? – Petr Razumov Jun 26 '16 at 12:18
  • 1
    Okay, thanks all! I'm going to accept this answer, collect more data and then ask a question on [stats.se]. – Petr Razumov Jun 26 '16 at 12:53
  • @ZheyuanLi I did more tests and re-asked my question here http://stats.stackexchange.com/q/220996/121311 – Petr Razumov Jun 28 '16 at 09:36