5

enter image description here

Left-side CSV file has five columns .application column has several app types delimited with ;. Depending on the app, device and district type, I want to predict the target. But I want to first convert the file in to the right-hand side dataframe to apply machine learning.

How can I do this using python?

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
Charith Ellepola
  • 302
  • 2
  • 13
  • Where are the values for the table on the right coming from? If it is coming from a separate CSV file you could create an empty data frame and then read the values into their respective columns. – Mitchell Leefers May 01 '19 at 17:25
  • Possible duplicate of [How can I one hot encode in Python?](https://stackoverflow.com/questions/37292872/how-can-i-one-hot-encode-in-python) – G. Anderson May 01 '19 at 17:30
  • Righ side table values are taken from a Hive sql table. I have a seperate csv file for that . – Charith Ellepola May 01 '19 at 17:41

2 Answers2

5

You need to apply multi-hot encoding for application column and one hot encoding for other columns.

Here is my solution!

>>> import pandas as pd
>>> import numpy as np

>>> df = pd.DataFrame({'number': np.random.randint(0,10,size=5),
                  'device': np.random.choice(['a','b'],size=5),
                  'application': ['app2;app3','app1','app2;app4', 'app1;app2', 'app1'],
                  'district': np.random.choice(['aa', 'bb', 'cc'],size=5)})

>>> df

    application device  district    number
0   app2;app3   b         aa    3
1   app1        a         cc    7
2   app2;app4   a         aa    3
3   app1;app2   b         bb    9
4   app1        a         cc    4

from sklearn.preprocessing import OneHotEncoder, MultiLabelBinarizer

mlb = MultiLabelBinarizer()
# Assuming appl names are separated by ;
mhv = mlb.fit_transform(df['application'].apply(lambda x: set(x.split(';'))))
df_out = pd.DataFrame(mhv,columns=mlb.classes_)

enc = OneHotEncoder(sparse=False)
ohe_vars = ['device','district'] # specify the list of columns here
ohv = enc.fit_transform(df.loc[:,ohe_vars])
ohe_col_names = ['%s_%s'%(var,cat) for var,cats in zip(ohe_vars, enc.categories_) for cat in cats]

df_out.assign(**dict(zip(ohe_col_names,ohv.T)))

df_out

enter image description here

Areza
  • 5,623
  • 7
  • 48
  • 79
Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
0

You can first read in the csv the way it is and then do all of the editing. So first..

   df = pd.read_csv("data.csv")

then take the data in the application column and make it into a list. Then take the list and join it into a string and split it back up at the ";" deliminator.

   col = list(df.application)
   col_join = ";".join(col)
   col_split = col_join.split(";")

After you have that list you can loop through and delete the duplicates...

   i=0
   while i < len(col_split):
      j=i+1
      while j < len(col_split):
           if col_split[i] == col_split[j]:
               del(col_split[j])
      else:
        j += 1
    i += 1

Now you have a list of all of the unique application entries that you can read into a data frame to use as columns.

Mitchell Leefers
  • 170
  • 1
  • 13
  • Thank you very much Mitchell . But the issue here is i do not know how many features will be there in a column . The above table is just an example to show the format. the CV has more than 10K rows with numerous app types. – Charith Ellepola May 01 '19 at 17:53
  • @CharithEllepola I have edited my post and I believe this is what you are looking for. – Mitchell Leefers May 01 '19 at 19:09