-3

I am trying to predict which gear vehicle is driven. I have Engine_Speed and vehicle_Speed column in the data set:

Enter image description here

I have tried the k-means clustering algorithm, but it didn't succeed.

Which algorithm do I have to use? And how do I implement it using Python?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Sanjiv
  • 980
  • 2
  • 11
  • 29
  • 2
    If you have features but no label, which seems to be the case, you have to consider unsupervised learning metrics. The one that most people learn first and the one that runs fastest is kmeans. This doesn't mean it's the best, nor does it mean it will give you good results, but you have to apply an unsupervised learning method so that's a good place to start. – DejaVuSansMono Jun 01 '20 at 15:44
  • 1
    In what way didn't the k-means clustering algorithm succeed? – Peter Mortensen Jul 21 '20 at 12:46
  • ([Question formation](https://www.youtube.com/watch?v=t4yWEt0OSpg&t=1m49s)) – Peter Mortensen Jul 21 '20 at 12:47

2 Answers2

2

Looking at the vehicle speed in relation to the engine speed, the different slopes should give the different gears.


My initial reaction would be to say that this is a linear regression problem. You don't have enough data for anything else. Looking at the data, though, we can see that it is actually two linear regression problems:

[![Engine speed vs. vehicle speed][2]][2]

There is an inflection point at about 700 revs, so you should design a cutoff that selects one of two regression lines, depending on whether you are above or below the cutoff.

To determine the regression in Python, you can use any number of packages. In scikit-learn it looks like this: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

The example given there, using the Python console, is

>>> import numpy as np
>>> from sklearn.linear_model import LinearRegression
>>> X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
>>> # y = 1 * x_0 + 2 * x_1 + 3
>>> y = np.dot(X, np.array([1, 2])) + 3
>>> reg = LinearRegression().fit(X, y)
>>> reg.score(X, y)
1.0
>>> reg.coef_
array([1., 2.])
>>> reg.intercept_
3.0000...
>>> reg.predict(np.array([[3, 5]]))
array([16.])

Obviously you need to put your own data in X and y and in fact you would want two arrays for the two sections of your graph. You would also have two reg = LinearRegression().fit(X, y) expressions, and an if statement deciding which reg to use, depending on the input. The inflection point is at the intersection of your two regression lines.

The two regression lines have the form y = m1 x + c1 and y = m2 x + c2, where m1, m2 are the gradients of the lines and c1, c2 the intercepts. At the point of intersection m1x + c1 = m2x + c2. If you don't want to do the maths, then you can use Shapely:

import shapely
from shapely.geometry import LineString, Point

line1 = LineString([A, B])
line2 = LineString([C, D])

int_pt = line1.intersection(line2)
point_of_intersection = int_pt.x, int_pt.y

print(point_of_intersection)

(taken from this answer on Stack Overflow: How do I compute the intersection point of two lines?)


After discussion with Sanjiv, here is the updated code (adapted from here: https://machinelearningmastery.com/clustering-algorithms-with-python/)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
from sklearn.cluster import KMeans

matplotlib.use('TkAgg')

df = pd.read_excel("GearPredictionSanjiv.xlsx", sheet_name='FullData')
x = []
y = []
x = round(df['Engine_speed'])
y = df['Vehicle_speed']
if 'Ratio' not in df.columns or not os.path.exists('dataset.xlsx'):
    df['Ratio'] = round(x/y)


model = KMeans(n_clusters=5)

# Fit the model
model.fit(X)

# Assign a cluster to each example
yhat = model.predict(X)


# Plot
plt.scatter(yhat, X['Ratio'], c=yhat, cmap=plt.cm.coolwarm)

# Show the plot
plt.show()

Cody Gray - on strike
  • 239,200
  • 50
  • 490
  • 574
AlDante
  • 336
  • 3
  • 8
  • :- I thought i have to use clustering algorithm for that.? – Sanjiv Jun 01 '20 at 07:34
  • I mean, this will give you the relationship between engine speed and vehicle speed, but it's not going to tell you what gear the vehicle is in, which is what the original question seems to be. – DejaVuSansMono Jun 01 '20 at 15:41
  • 1
    My apologies, I assumed that the gear was the column marked vehicle speed, as it has values from 2 to 5. If you are supposed to use clustering, then you need to make an assumption as to how many gears there are. Five would be a reasonable assumption, but you won't be able to distinguish between 1 and 2, as you have no data for 1, so you will actually need a four cluster model. It still makes more sense to me as a regression problem where there has been a translation error somewhere for vehicle speed. Where did the problem come from? – AlDante Jun 01 '20 at 20:14
  • @AlDante:- Gear 1 is there.. i have share only few data. and one logic is there. `ratio = engine_speed / vehicle_speed`. Higher ratio denotes 1 gear and like wise. – Sanjiv Jun 02 '20 at 06:05
  • @AlDante:- Any idea how to implement using python? – Sanjiv Jun 02 '20 at 06:11
  • 1
    Hi Sanjiv, the graph of Vehicle Speed vs. Engine speed is exactly what you have listed as the ratio. Each different slope is a different gear. It is incomprehensible to me why there is no vehicle speed increase between about 750 revs and 900 revs - you would only get that in neutral. To implement in python, calculate the slopes of the 3 working gears. Then, for any new pair of (vehicle speed, engine speed), compute the ratio as you said and then pick the gear with the nearest ratio. You can also say that anything more than about 10% away from your gear ratios is an error, or another gear. – AlDante Jun 02 '20 at 06:42
  • 1
    Also, please post the full data available. – AlDante Jun 02 '20 at 06:42
0

The question is somewhat confusing.

I assume you want to infer the vehicle speed using the engine_speed. Then there is only one feature in this dataset (i.e., engine speed) and the class label is vehicle speed. Actually, a simple IF THEN ELSE can solve the statement but for the sake of answering your question using a machine learning approach (e.g., Decision Tree), I will share how to solve this as a classification problem using scikit-learn in Python.

import numpy as np
from sklearn import tree
from sklearn.metrics import accuracy_score

###  np.reshape(array, (-1, 1)) is to convert the array to 2D array
engine_speed = np.reshape([1124, 974, 405, 865, 754, 200], (-1, 1))
vehicle_speed = np.reshape([5, 4, 3, 4, 4, 2], (-1, 1))

test_engine_speed = np.reshape([1000, 900, 800, 700, 600, 500, 400], (-1, 1))
test_vehicle_speed = np.reshape([5, 4, 4, 4, 4, 3, 3], (-1, 1))

clf = tree.DecisionTreeClassifier()
clf = clf.fit(engine_speed, vehicle_speed)

y_pred = clf.predict(test_engine_speed)

print(accuracy_score(test_vehicle_speed, y_pred))
print(test_vehicle_speed.ravel()) # ravel() is to convert 2D array to 1D array
print(y_pred.ravel())             # ravel() is to convert 2D array to 1D array

I hope this would be helpful.

  • :- Actually Requirement is:- I have two columns namely, `Engine_speed` and `Vehicle_speed`. Based on these two columns records, i have to find my vehicle is driven on which gear. **Logic is `Ratio= Engine_speed/Vehicle_speed`**. HIGHER THE RATIO, LOWER THE GEAR. – Sanjiv Jun 17 '20 at 15:12