how to handle the missing values like this and date format for regression?

Question

I want to make the regression model from this dataset(first two are dependent variable and last one is dependent variable).I have import dataset using dataset=pd.read_csv('data.csv') Now I have made model previously also but never have done with date format dataset as independent variable so how should we handle these date format to make the regression model. also how should we handle 0 value data in given dataset. My dataset is like:in .csv format:

Month/Day, Sales, Revenue
01/01    ,  0    , 0
01/02    , 100000, 0
01/03    , 400000, 0
01/06    ,300000, 0
01/07    ,950000, 1000000
01/08    ,10000,  15000
01/10    ,909000, 1000000
01/30    ,12200,  12000
02/01   ,950000,  1000000
02/09     ,10000, 15000
02/13    ,909000, 1000000
02/15    ,12200,  12000

I don't know to handle this format date and 0 value

Please see https://stackoverflow.com/help/mcve for how to post good SO questions. What have you tried so far? What was the output? What is your desired output? That said, you can use `pandas.read_csv()` to read a CSV file, and `pandas.DataFrame.corr()` to find correlations. I do not think this is a ML or DL problem. — Evan, Jan 12 '18 at 16:17
Possible duplicate of [Use .corr to get the the correlation between two columns](https://stackoverflow.com/questions/42579908/use-corr-to-get-the-the-correlation-between-two-columns) — Evan, Jan 12 '18 at 16:19
Sir I have import the file using pd.read_csv().But I know how to do feature scaling , model_selection, Imputer also but I have never make the model using date as independent variable. So how to convert the date so we can make the model without error — codegear, Jan 12 '18 at 16:24
Have you tried `pandas.to_datetime`? https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html — Evan, Jan 12 '18 at 16:43
sir actually i am new to data science so i have search in google also whole day I found only how to handle the date of type: dd-mm-YYYY not like mm/dd.so can you help sir? — codegear, Jan 12 '18 at 16:44

Evan · Accepted Answer · 2018-01-12T21:02:44.057

2

Here's a start. I saved your data into a file and stripped all the whitespace.

import pandas as pd
df = pd.read_csv('20180112-2.csv')
df['Month/Day'] = pd.to_datetime(df['Month/Day'], format = '%m/%d')
print(df)

Output:

    Month/Day   Sales  Revenue
0  1900-01-01       0        0
1  1900-01-02  100000        0
2  1900-01-03  400000        0
3  1900-01-06  300000        0
4  1900-01-07  950000  1000000
5  1900-01-08   10000    15000
6  1900-01-10  909000  1000000
7  1900-01-30   12200    12000
8  1900-02-01  950000  1000000
9  1900-02-09   10000    15000
10 1900-02-13  909000  1000000
11 1900-02-15   12200    12000

The year defaults to 1900 since it is not provided in your data. If you need to change it, that's an additional, different question. To change the year, see: Pandas: Change day

import datetime as dt
df['Month/Day'] = df['Month/Day'].apply(lambda dt: dt.replace(year = 2017))
print(df)

Output:

    Month/Day   Sales  Revenue
0  2017-01-01       0        0
1  2017-01-02  100000        0
2  2017-01-03  400000        0
3  2017-01-06  300000        0
4  2017-01-07  950000  1000000
5  2017-01-08   10000    15000
6  2017-01-10  909000  1000000
7  2017-01-30   12200    12000
8  2017-02-01  950000  1000000
9  2017-02-09   10000    15000
10 2017-02-13  909000  1000000
11 2017-02-15   12200    12000

Finally, to find the correlation between columns, just use df.corr():

print(df.corr())

Output:

            Sales   Revenue
Sales    1.000000  0.953077
Revenue  0.953077  1.000000

edited Jan 12 '18 at 21:02

answered Jan 12 '18 at 18:01

Evan

2,121
14
27

sir can you give idea how to handle month/day variable to compute with other integer number – codegear Jan 13 '18 at 03:23
Can you explain what you mean more clearly? Or post a new question / search for old questions addressing what you need? – Evan Jan 13 '18 at 08:25
sir imagine if you need to build the model by using these data to predict revenue what necessary step you take : mainly what you do to date for making model either we have to convert to any value or just simply ignore it or something else. and there is missing date also so how we can predict the value – codegear Jan 13 '18 at 13:47
Most models can use dates or time series data. Dealing with missing values or zero values is the decision of the analyst or data scientist creating the model. To drop data containing zeros, see the solution here: https://stackoverflow.com/questions/22649693/drop-rows-with-all-zeros-in-pandas-data-frame `df = df.loc[(df!=0).all(axis=1)]` – Evan Jan 13 '18 at 15:43
should I have to change the date value to encoding value or should I have to change it into string,integer or float. Please help me using example of above dataset. – codegear Jan 13 '18 at 17:51

score 0 · Answer 2 · answered Jan 13 '18 at 14:43

How to handle missing data?

There is a number of ways to replace it. By average, by median or using moving average window or even RF-approach (or similar, MICE and so on). For 'sales' column you can try any of this methods. For 'revenue' column better not to use any of this especially if you have many missing values (it will harm the model). Just remove rows with missing values in 'revenue' column.

By the way, a few methods in ML accept missing values: XGBoost and in some way Trees/Forests. For the latest ones you may replace zeroes to some very different values like -999999.

What to do with the data?

Many things related to feature engineering can be done here: 1. Day of week 2. Weekday or weekend 3. Day in month (number) 4. Pre- or post-holiday 5. Week number 6. Month number 7. Year number 8. Indication of some factors (for example, if it is fruit sales data you can some boolean columns related to it) 9. And so on...

Almost every feature here should be preprocessed via one-hot-encoding.

And clean from correlations of course if you use linear models.

so for day and month i have to do encoding using one hot encoder and should I have to use dummy variable ?? If I use then there will many column — codegear, Jan 13 '18 at 17:50
Sure you can use dimensionality reduction techniques for dummy columns. — avchauzov, Jan 13 '18 at 22:17
Additionally you can preprocess each column to have classes with rate >= 5%. All rare values will be generalized into 'other' class. — avchauzov, Jan 13 '18 at 22:20

how to handle the missing values like this and date format for regression?

2 Answers2