Feature extraction from the training data

Question

I have a training data like below which have all the information under a single column. The data set has above 300000 data.

id         features                                                     label

1          name=John Matthew;age=25;1.=Post Graduate;2.=Football Player;    1
2          name=Mark clark;age=21;1.=Under Graduate;Interest=Video Games;   1
3          name=David;age=12;1:=High School;2:=Cricketer;native=america;    2
4          name=George;age=11;1:=High School;2:=Carpenter;married=yes       2
.
.

300000     name=Kevin;age=16;1:=High School;2:=Driver;Smoker=No             3

Now i need to convert this training data like below

 id   name          age   1               2                Interest      married   Smoker
 1    John Matthew   25   Post Graduate   Football Player   Nan           Nan      Nan
 2    Mark clark     21   Under Graduate  Nan               Video Games   Nan      Nan
 .
 .

Is there any efficient way to do this. I tried the below code but it took 3 hours to complete

#Getting the proper features from the features column

    cols = {}
    for choices in set_label:
        collection_list = []
        array = train["features"][train["label"] == choices].values
        for i in range(1,len(array)):
            var_split = array[i].split(";")
            try :
                d = (dict(s.split('=') for s in var_split))
                for x in d.keys():
                    collection_list.append(x)
            except ValueError:
                Error = ValueError
        count = Counter(collection_list)
        for k , v in count.most_common(5):
            key = k.replace(":","").replace(" ","_").lower()
            cols[key] = v

    columns_add = list(cols.keys())
    train = train.reindex(columns = np.append( train.columns.values, columns_add))
    print (train.columns)
    print (train.shape)

#Adding the values for the newly created problem

    for row in train.itertuples():
        dummy_dic = {}
        new_dict={}
        value = train.loc[row.Index, 'features']
        v_split = value.split(";")
        try :
            dummy_dict = (dict(s.split('=') for s in v_split))
            for k, v in dummy_dict.items():
                new_key = k.replace(":","").replace(" ","_").lower()
                new_dict[new_key] = v
        except ValueError:
            Error = ValueError
        for k,v in new_dict.items():
            if k in train.columns:
                train.loc[row.Index, k] = v

Is there any useful function that i can apply here for efficient way of feature extraction ?

thanasissdr · Accepted Answer · 2018-02-02T15:23:44.403

Create two DataFrames (in the first one all the features are the same for every data point and the second one is a modification of the first one introducing different features for some data points) meeting your criteria:

import pandas as pd
import numpy as np
import random
import time
import itertools


# Create a DataFrame where all the keys for each datapoint in the "features" column are the same.
num = 300000


NAMES = ['John', 'Mark', 'David', 'George', 'Kevin']
AGES = [25, 21, 12, 11, 16]
FEATURES1 = ['Post Graduate', 'Under Graduate', 'High School']
FEATURES2 = ['Football Player', 'Cricketer', 'Carpenter', 'Driver']
LABELS = [1, 2, 3]



df = pd.DataFrame()
df.loc[:num, 0]= ["name={0};age={1};feature1={2};feature2={3}"\
                  .format(NAMES[np.random.randint(0, len(NAMES))],\
                          AGES[np.random.randint(0, len(AGES))],\
                          FEATURES1[np.random.randint(0, len(FEATURES1))],\
                          FEATURES2[np.random.randint(0, len(FEATURES2))]) for i in xrange(num)]

df['label'] = [LABELS[np.random.randint(0, len(LABELS))] for i in range(num)]

df.rename(columns={0:"features"}, inplace=True)

print df.head(20)



# Create a modified sample DataFrame from the previous one, where not all the keys are the same for each data point. 


mod_df = df
random_positions1 = random.sample(xrange(10), 5)
random_positions2 = random.sample(xrange(11, 20), 5)

INTERESTS = ['Basketball', 'Golf', 'Rugby']
SMOKING = ['Yes', 'No']

mod_df.loc[random_positions1, 'features'] = ["name={0};age={1};interest={2}"\
                  .format(NAMES[np.random.randint(0, len(NAMES))],\
                          AGES[np.random.randint(0, len(AGES))],\
                          INTERESTS[np.random.randint(0, len(INTERESTS))]) for i in xrange(len(random_positions1))]

mod_df.loc[random_positions2, 'features'] = ["name={0};age={1};smoking={2}"\
                  .format(NAMES[np.random.randint(0, len(NAMES))],\
                          AGES[np.random.randint(0, len(AGES))],\
                          SMOKING[np.random.randint(0, len(SMOKING))]) for i in xrange(len(random_positions2))]


print mod_df.head(20)

Assume that your original data is stored in a DataFrame called df.

Solution 1 (all the features are the same for every data point).

def func2(y):
        lista = y.split('=')
        value = lista[1]
        return value


def function(x):
    lista = x.split(';')
    array = [func2(i) for i in lista]
    return array


# Calculate the execution time
start = time.time()

array = pd.Series(df.features.apply(function)).tolist()
new_df = df.from_records(array, columns=['name', 'age', '1', '2'])

end = time.time()

new_df

print 'Total time:', end - start

Total time: 1.80923295021

Edit: The one thing you need to do is to edit accordingly the columns list.

Solution 2 (The features might be the same or different for every data point).

import pandas as pd
import numpy as np
import time
import itertools

# The following functions are meant to extract the keys from each row, which are going to be used as columns.
def extract_key(x):
    return x.split('=')[0]

def def_columns(x):
    lista = x.split(';')
    keys = [extract_key(i) for i in lista]
    return keys

df = mod_df
columns = pd.Series(df.features.apply(def_columns)).tolist()
flattened_columns = list(itertools.chain(*columns))
flattened_columns = np.unique(np.array(flattened_columns)).tolist()
flattened_columns

# This function turns each row from the original dataframe into a dictionary.
def function(x):
    lista = x.split(';')
    dict_ = {}
    for i in lista:
        key, val = i.split('=')
        dict_[key ] = val
    return dict_


df.features.apply(function)
arr = pd.Series(df.features.apply(function)).tolist()
pd.DataFrame.from_dict(arr)

score 0 · Answer 2 · answered Feb 02 '18 at 12:27

Suppose your data is like this :

features= ["name=John Matthew;age=25;1:=Post Graduate;2:=Football Player;", 
 'name=Mark clark;age=21;1:=Under Graduate;2:=Football Player;',
"name=David;age=12;1:=High School;2:=Cricketer;",
"name=George;age=11;1:=High School;2:=Carpenter;", 
'name=Kevin;age=16;1:=High School;2:=Driver; ']
df = pd.DataFrame({'features': features})

I will start by this answer and try to replace all separator (name, age , 1:= , 2:= ) by ;

with this function

def replace_feature(x):
    for r in (("name=", ";"), (";age=", ";"), (';1:=', ';'), (';2:=', ";")):
        x = x.replace(*r)
    x = x.split(';')
    return x
df = df.assign(features= df.features.apply(replace_feature))

After applying that function to your df all the values will a list of features. where you can get each one by index then I use 4 customs function to get each attribute name, age, grade; job, Note: There can be a better way to do this by using only one function

def get_name(df):
    return df['features'][1]
def get_age(df):
    return df['features'][2]
def get_grade(df):
    return df['features'][3]
def get_job(df):
    return df['features'][4]

And finaly applying that function to your dataframe :

df = df.assign(name = df.apply(get_name, axis=1),
         age = df.apply(get_age, axis=1),
         grade = df.apply(get_grade, axis=1),
         job = df.apply(get_job, axis=1))

Hope this will be quick and fast

Thanks for the solution. But each row is not constrained only with name , age, 1.,2. . There are lots of other features in the upcoming rows which are also needs to be transformed into features. — Balaji, Feb 02 '18 at 12:45
Yes Espoir. As there are 300000 rows we do have more than 300 features — Balaji, Feb 03 '18 at 00:46

score 0 · Answer 3 · answered Feb 02 '18 at 12:32

As far as I understand your code, the poor performances comes from the fact that you create the dataframe element by element. It's better to create the whole dataframe at once whith a list of dictionnaries.

Let's recreate your input dataframe :

from StringIO import StringIO
data=StringIO("""id         features                                                     label

1          name=John Matthew;age=25;1.=Post Graduate;2.=Football Player;    1
2          name=Mark clark;age=21;1.=Under Graduate;2.=Football Player;     1
3          name=David;age=12;1:=High School;2:=Cricketer;                   2
4          name=George;age=11;1:=High School;2:=Carpenter;                  2""")
df=pd.read_table(data,sep=r'\s{3,}',engine='python')

we can check :

print df
   id                                           features  label
0   1  name=John Matthew;age=25;1.=Post Graduate;2.=F...      1
1   2  name=Mark clark;age=21;1.=Under Graduate;2.=Fo...      1
2   3     name=David;age=12;1:=High School;2:=Cricketer;      2
3   4    name=George;age=11;1:=High School;2:=Carpenter;      2

Now we can create the needed list of dictionnaries with the following code :

feat=[]
for line in df['features']:
    line=line.replace(':','.')
    lsp=line.split(';')[:-1]
    feat.append(dict([elt.split('=') for elt in lsp]))

And the resulting dataframe :

print pd.DataFrame(feat)
               1.               2. age          name
0   Post Graduate  Football Player  25  John Matthew
1  Under Graduate  Football Player  21    Mark clark
2     High School        Cricketer  12         David
3     High School        Carpenter  11        George

The data has 300000 rows so how can we proceed the first step by copying the entire thing in data variable ? — Balaji, Feb 02 '18 at 12:43
I guess your data is in a file so you dont need the fisrt step, give the name of your file instead of data in df=pd.read_table(filename, ....) — manu190466, Feb 02 '18 at 12:59

Feature extraction from the training data

3 Answers3

Solution 1 (all the features are the same for every data point).

Solution 2 (The features might be the same or different for every data point).