-3

I have this plot

enter image description here

Now I want to add a trend line to it, how do I do that?

The data looks like this:

enter image description here

I wanted to just plot how the median listing price in California has gone up over the years so I did this:

# Get California data
state_ca = []
state_median_price = []
state_ca_month = []
for state, price, date in zip(data['ZipName'], data['Median Listing Price'], data['Month']):
    if ", CA" not in state:
        continue
    else:
        state_ca.append(state)
        state_median_price.append(price)
        state_ca_month.append(date)

Then I converted the string state_ca_month to datetime:

# Convert state_ca_month to datetime
state_ca_month = [datetime.strptime(x, '%m/%d/%Y %H:%M') for x in state_ca_month]

Then plotted it

# Plot trends
figure(num=None, figsize=(12, 6), dpi=80, facecolor='w', edgecolor='k')
plt.plot(state_ca_month, state_median_price)
plt.show()

I thought of adding a trendline or some type of line but I am new to visualization. If anyone has any other suggestions I would appreciate it.

Following the advice in the comments I get this scatter plot

enter image description here

I am wondering if I should further format the data to make a clearer plot to examine.

Community
  • 1
  • 1
Wolfy
  • 548
  • 2
  • 9
  • 29
  • Why are you using bars instead of a scatter plot for data that you want a trendline for? – Reedinationer Mar 22 '19 at 17:09
  • Even if you would have any trendline command - what kind of trend would you expect derived from these data? Do you think this characteristic looks as if one of its most important properties was a meaningful trend? – SpghttCd Mar 22 '19 at 17:11
  • @Reedinationer if you look closer: that's not a bar plot - it's a normal line plot... – SpghttCd Mar 22 '19 at 17:12
  • @SpghttCd yes that's the term I was looking for. I'm sure everybody could understand my intent with the previous question though... – Reedinationer Mar 22 '19 at 17:13
  • @SpghttCd Okay perhaps just a line fitted to the top of the bars? – Wolfy Mar 22 '19 at 17:14
  • 2
    Wolfy try replacing `plt.plot(state_ca_month, state_median_price)` with `plt.scatter(state_ca_month, state_median_price)` so it only shows points and doesn't draw lines between them. I think this will give a much more clear view of your data to start from – Reedinationer Mar 22 '19 at 17:46
  • 1
    It appears you are plotting (I assume house) prices for a whole range of zip codes over time. I've found moving averages are a good way to see trends over time but they'd only make sense for each zip code. i.e. one trend line per zip code. There is a moving avenue answer [here](https://stackoverflow.com/questions/14313510/how-to-calculate-moving-average-using-numpy). You can generate a trend line for one data series. To get **one** line you need **one** average data series from all the data series. – Tls Chris Mar 22 '19 at 21:24

1 Answers1

1

If by "trend line" you mean a literal line, then you probably want to fit a linear regression to your data. sklearn provides this functionality in python.

From the example hyperlinked above:

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

# Load the diabetes dataset
diabetes = datasets.load_diabetes()


# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(diabetes_y_test, diabetes_y_pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(diabetes_y_test, diabetes_y_pred))

# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

To clarify, "the overall trend" is not a well-defined thing. Many times, by "trend", people mean a literal line that "fits" the data well. By "fits the data", in turn, we mean "predicts the data." Thus, the most common way to get a trend line is to pick a line that best predicts the data that you have observed. As it turns out, we even need to be clear about what we mean by "predicts". One way to do this (and a very common one) is by defining "best predicts" in such a way as to minimize the sum of the squares of all of the errors between the "trend line" and the observed data. This is called ordinary least squares linear regression, and is one of the simplest ways to obtain a "trend line". This is the algorithm implemented in sklearn.linear_model.LinearRegression.

Him
  • 5,257
  • 3
  • 26
  • 83
  • I am not sure how this helps, I am not trying to predict anything. I just wanted to visualize the data to show an overall trend or whatever. – Wolfy Mar 22 '19 at 18:12
  • Yes, correct so far. I just don't understand in these times where _ml_, _ai_, _dl_...... are not only in everbodys speech, but additionally are sold as the versatile solution for everything... Well however, I don't understand why on earth nowadays even for a _simple linear regression_ - a _machine learning lib_ is loaded and propagated?!! Numpy does this for years fine, scipy does it of course anyway - i don't want to offend anyone, for sure, but this is my first and a suiting moment to ask this crazy little thing called internet: WTF...? – SpghttCd Mar 22 '19 at 19:08