3

I have a dataframe called data from which I am trying to identify any outlier prices.

The data frame head looks like:

         Date  Last Price
0  29/12/2017     487.74
1  28/12/2017     422.85
2  27/12/2017     420.64
3  22/12/2017     492.76
4  21/12/2017     403.95

I have found a some code which I need to adjust slightly for my data that loads the data and then compares the timeseries to a scaler. The code looks like:

    data = pd.read_csv(path) 
    data = rawData['Last Price']

    data = data['Last Price']
    scaler = StandardScaler()
    np_scaled = scaler.fit_transform(data)
    data = pd.DataFrame(np_scaled)
    # train oneclassSVM 
    outliers_fraction = 0.01
    model = OneClassSVM(nu=outliers_fraction, kernel="rbf", gamma=0.01)
    model.fit(data)
    data['anomaly3'] = pd.Series(model.predict(data))

    fig, ax = plt.subplots(figsize=(10,6))
    a = data.loc[data['anomaly3'] == -1, ['date_time_int', 'Last Price']] #anomaly

    ax.plot(data['date_time_int'], data['Last Price'], color='blue')
    ax.scatter(a['date_time_int'],a['Last Price'], color='red')
    plt.show();

def getDistanceByPoint(data, model):
    distance = pd.Series()
    for i in range(0,len(data)):
        Xa = np.array(data.loc[i])
        Xb = model.cluster_centers_[model.labels_[i]-1]
        distance.set_value(i, np.linalg.norm(Xa-Xb))
    return distance

However get the error message:

ValueError: Expected 2D array, got 1D array instead:
array=[487.74 422.85 420.64 ... 461.57 444.33 403.84].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

and I am unsure as to where I need to resize the array.

For information, here is the trace back:

 File "<ipython-input-23-628125407694>", line 1, in <module>
    runfile('C:/Users/stacey/Downloads/techJob.py', wdir='C:/Users/stacey/Downloads')

  File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 786, in runfile
    execfile(filename, namespace)

  File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "C:/Users/staceyDownloads/techJob.py", line 92, in <module>
    main()

  File "C:/Users/stacey/Downloads/techJob.py", line 56, in main
    np_scaled = scaler.fit_transform(data)

  File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\sklearn\base.py", line 464, in fit_transform
    return self.fit(X, **fit_params).transform(X)

  File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\sklearn\preprocessing\data.py", line 645, in fit
    return self.partial_fit(X, y)

  File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\sklearn\preprocessing\data.py", line 669, in partial_fit
    force_all_finite='allow-nan')

  File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\sklearn\utils\validation.py", line 552, in check_array
    "if it contains a single sample.".format(array))

ValueError: Expected 2D array, got 1D array instead:
array=[7687.77 7622.88 7620.68 ... 5261.57 5244.37 5203.89].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
halfer
  • 19,824
  • 17
  • 99
  • 186
Stacey
  • 4,825
  • 17
  • 58
  • 99
  • 1
    Python always gives you a traceback showing the source of the problem. Please copy it into your question. – John Zwinck Nov 26 '19 at 11:47
  • Here is your problem. You can find the exact line in the traceback: `File "C:/Users/stacey/Downloads/SIGtechJob.py", line 56, in main np_scaled = scaler.fit_transform(data)` – CAPSLOCK Nov 26 '19 at 12:08
  • Do you understand what `sklearn` means by `features`? It has certain conventions for the data inputs. It's a good idea to study those; otherwise you could end up patching your code, one error at a time, without understanding why. – hpaulj Nov 26 '19 at 17:33

1 Answers1

4

You should be able to fix the error by changing this line:

np_scaled = scaler.fit_transform(data)

with this:

np_scaled = scaler.fit_transform(data.values.reshape(-1,1))
CAPSLOCK
  • 6,243
  • 3
  • 33
  • 56