Poorly fitting curve in natural log regression

Question

I'm fitting a logarithmic curve to 20+ data sets using the equation

y = intercept +  coefficient * ln(x)

Generated in R via

output$curvePlot <- renderPlot ({
    x=medianX
    y=medianY
    Estimate = lad(formula = y~log(x),method = "EM")
    logEstimate = lad(formula = y~log(x),method = "EM")
    plot(x,predict(Estimate),type='l',col='white')
    lines(x,predict(logEstimate),col='red')
    points(x,y)
    cf <- round(coef(logEstimate),1)
    eq <- paste0("y = ", cf[1],
        ifelse(sign(cf[2])==1, " + ", " - "), abs(cf[2]), " * ln(x) from 0 to ",xmax)
    mtext(eq,3,line=-2,col = "red")
    output$summary <- renderPrint(summary(logEstimate))
    output$calcCurve <- 
        renderPrint(round(cf[2]*log(input$calcFeet)+cf[1]))

    })

The curve consistently "crosses twice" on the data; fitting too low at low/high points on the X axis, fitting too high at the middle of the X axis.

I don't really understand where to go from here. Am I missing a factor or using the wrong curve?

The dataset is about 60,000 rows long, but I condensed it into medians. Medians were selected due to unavoidable outliers in the data, particularly a thick left tail, caused by our instrumentation.

x,y
2,6.42
4,5.57
6,4.46
8,3.55
10,2.72
12,2.24
14,1.84
16,1.56
18,1.33
20,1.11
22,0.92
24,0.79
26,0.65
28,0.58
30,0.34
32,0.43
34,0.48
36,0.38
38,0.37
40,0.35
42,0.32
44,0.21
46,0.25
48,0.24
50,0.25
52,0.23

Full methodology for context:

Samples of dependent variable, velocity (ft/min), were collected at various distances from fan nozzle with a NIST-calibrated hot wire anemometer. We controlled for instrumentation accuracy by subjecting the anemometer to a weekly test against a known environment, a pressure tube with a known aperture diameter, ensuring that calibration was maintained within +/- 1%, the anemometer’s published accuracy rating.

We controlled for fan alignment with the anemometer down the entire length of the track using a laser from the center of the fan, which aimed no more than one inch from the center of the anemometer at any distance.

While we did not explicitly control for environmental factors, such as outdoor air temperature, barometric pressure, we believe that these factors will have minimal influence on the test results. To ensure that data was collected evenly in a number of environmental conditions, we built a robot that drove the anemometer down the track to a different distance every five minutes. This meant that data would be collected at every independent variable position repeatedly, over the course of hours, rather than at one position over the course of hours. As a result, a 24 hour test would measure the air velocity at each distance over 200 times, allowing changes in temperature as the room warmed or cooled throughout the day to address any confounding environmental factors by introducing randomization.

The data was collected via Serial port on the hot wire anemometer, saving a timestamped CSV that included fields: Date, Time, Distance from Fan, Measured Temperature, and Measured Velocity. Analysis on the data was performed in R.

Testing: To gather an initial set of hypotheses, we took the median of air velocity at each distance. The median was selected, rather than the mean, as outliers are common in data sets measuring physical quantities. As air moves around the room, it can cause the airflow to temporarily curve away from the anemometer. This results in outliers on the low end that do not reflect the actual variable we were trying to measure. It’s also the case that, sometimes, the air velocity at a measured distance appears to “puff,” or surge and fall. This is perceptible by simply standing in front of the fan, and it happens on all fans at all distances, to some degree. We believe the most likely cause of this puffing is due to eddy currents and entrainment of the surrounding air, temporarily increasing airflow. The median result absolves us from worrying about how strong or weak a “puff” may feel, and it helps limit the effects on air speed of the air curving away from the anemometer, which does not affect actual air velocity, but only measured air velocity. With our initial dataset of medians, we used logarithmic regression to calculate a curve to match the data and generated our initial velocity profiles at set distances. To validate that the initial data was accurate, we ran 10 monte carlo folding simulations at 25% of the data set and ensured that the generated medians were within a reasonable value of each other.

Validation: Fans were run every three months and the monte carlo folding simulations were observed. If the error rate was <5% from our previous test, we validated the previous test.

Where does the `lad` function come from? It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. I'm not sure I see a programming question here. If you need help with model fitting, you should probably ask at [stats.se] instead. — MrFlick, Apr 08 '19 at 15:45
You should provide your data so we can figure out what is wrong. — Jet, Apr 08 '19 at 15:45
I'm also confused, in your first equation you have `y` as a function of `ln(x)`. But your R equation you have `x` as a function of `ln(y)`. Your plot label has `y` on the x-axis---not clear if the data is logged or not---but the equation on the plot is back to `ln(x)`. Makes me think you're mixing up your x and your y in at least one place, and that could be the problem. — Gregor Thomas, Apr 08 '19 at 15:49
Providing some data would indeed be useful. From what I see, it fitted the curve as best as it could with the equation you provided. Maybe the data does not have logarithmic behavior? Have you tried other equations? — Gilles-Philippe Paillé, Apr 08 '19 at 15:58
Thanks for the data. Can you also please address the other comments, (a) are you mixing up your variables? Can you say definitively which is the predictor (on the x-axis), which is the response (on the y-axis) and which one you want to log? and (b) Where is the `lad()` function from? When I search [rdocumentation.org for `lad`](https://www.rdocumentation.org/search?q=lad) I see functions of that name in the packages `MTE`, `L1pack`, `lidR`, `Blossom`, and more. Which one are you using? — Gregor Thomas, Apr 08 '19 at 17:04
Working through these piecemeal. lad is the least absolute deviation regression (Cade and Richards 1996). — smithandteam, Apr 08 '19 at 17:06
Re y/x, the axes are reversed in the plot for visualization. points(y,x). I had forgotten that I did that. I'm editing the question with axes un-reversed. — smithandteam, Apr 08 '19 at 17:12

Gilles-Philippe Paillé · Answer 1 · 2019-04-09T15:16:12.737

There is no problem with the code itself, you found the best possible fit using a logarithmic curve. I double-checked using Mathematica, and I obtain the same results.

The problem seems to reside in your model. From the data you provided and the description of the origin of the data, the logarithmic function might not the best model for your measurements. The description indicates that the velocity must be a finite value at x=0, and slowly tends towards 0 while going to infinity. However, the negative logarithmic function will be infinite at x=0 and negative after a while.

I am not a physicist, but my intuition would tend towards using the inverse-square law or using the exponential function. I tested both, and the exponential function gives way better results:

Poorly fitting curve in natural log regression

1 Answers1