0

I am new comer to the field of Machine Learning, and I have an excel sheet of this structure:

Columns = {date, ..., Inflation}

the first column is date the next columns are numbers and the last column is inflation which is decimal.

date ... Inflation
01/06/2016 ... -0.07363739
01/07/2016 ... -0.07363741

The problem is that I was asked to apply some classification algorithms over these forecast data such as (Naive Bayes, kNN, SVM, and maybe others as well) and compare the accuracy of these algorithms.

What I didn't understand is how to treat this data from a Classification perspective?

I did some Timeseries over the data with R and it worked, but I still can't apply the classification algorithms:

dft <- read_excel("./data.xlsx",
                          sheet = 1)
df <- ts(dft$inflation, frequency=12, start=c(2016,6))
plot.ts(df)
fit <- HoltWinters(df, beta=FALSE, gamma=FALSE)

Is there any help in how to work with this data for classification with R? Any help is appreciated

Data sample : https://drive.google.com/open?id=0B1gJg-F8Gb76a1N3NVBXNFd1bjg

lazurens
  • 35
  • 1
  • 6
  • You should be more specific, what exactly do you want to predict? Classification works, as the name suggests, with classes. If you want to predict a continuous variable, you are doing regression. Please add more details about what exactly your problem is. – meow Jul 09 '17 at 00:07
  • The target variable is 'inflation' so that is what we need to predictcan, can I share part of the data ? – lazurens Jul 09 '17 at 04:13

1 Answers1

0

You could share some sample lines of your data. So basically what you have is a regression problem. So either you categorize it e.g. bin it to certain categories or use regression approaches, e.g. Linear Regression/ Penalized Regression, Support Vector Regression etc..

In R you can manually categorize your variables (there are also packages) like follows:

cut_off_high = 0.88
cut_off_low = 0.55

high_inflation = sample_dataframe[which(sample_dataframe$inflation > cut_off),]
medium_inflation = sample_dataframe[which(sample_dataframe$inflation > cut_off_low & sample_dataframe$inflation <= cut_off_high),]
low_inflation = sample_dataframe[which(sample_dataframe$inflation < cut_off_low),]

high_inflation$inflation = "High"
medium_inflation$inflation = "Medium"
low_inflation$inflation = "Low"

Now this is just an example so you understand the idea of binning, in reality you'd want to use something like this e.g. Categorize continuous variable with dplyr.

I hope this answers your question, how you could use classification on your dataset. However, since you don't seem to know much regarding ML I'd suggest you stick to some easy regression algorithms so you also avoid multi-class classification problems.

An easy starter would be:

linear_regression_model = lm(inflation ~ variable_name_1 + variable_name_2 + .. + variable_name_n, data = your_data_frame)

However, If you go beyond simple models you will have to face hyperparameters, cross-validation etc. , which you should understand before applying them (you should also understand what a certain model does in order to know which to apply).

I guess stackoverflow does not substitute education, I'd strongly suggest you to educate yourself reasonably well before messing around with models and things you do not understand at all.

If you have a specific question, feel free to ask though.

meow
  • 2,062
  • 2
  • 17
  • 27
  • Thanks for your answer and for detailing many things to me. I edited the question and included the data. I also get your point of binning to make the continuous variable a categorical variable and procede with building the model. I hope that the data describes more the process of how to build a predictive model as well . Thanks! – lazurens Jul 09 '17 at 11:47