4

I have a dataframe with a column for app user-agents. What I need to do is to identify the particular app from this column. For example,

NewWordsWithFriendsFree/2.3 CFNetwork/672.1.15 Darwin/14.0.0 will be categorized in Words With Friends.

iPhone3,1; iPhone OS 7.1.2; com.fingerarts.sudoku2; 143441-1,24 will be Sudoku by FingerArts etc.

I will have another dataframe with the strings I need to match. For example,

Keyword                 Game 
NewWordsWithFriends     Words With Friends
com.fingerarts.sudoku   Sudoku by FingerArts

How do I do the lookup like this for a pandas dataframe? The dataframe for example is like

user    date                 user-agent
 A      2015-09-02 13:45:56  NewWordsWithFriendsFree/2.3 CFNetwork/672.1.15 Darwin/14.0.0
 B      2015-08-31 23:04:21  iPhone3,1; iPhone OS 7.1.2; com.fingerarts.sudoku2; 143441-1,24

I want a new column GameName after the lookup.

sfactor
  • 12,592
  • 32
  • 102
  • 152

2 Answers2

1

One possible way to achieve this would be:

import pandas as pd                                                              

# some example data
qry = pd.DataFrame.from_dict({"Keyword": ["NewWordsWithFriends",                 
                                          "com.fingerarts.sudoku"],              
                              "Game": ["Words With Friends",                     
                                       "Sudoku by FingerArts"]})                 

df = pd.DataFrame.from_dict({"user-agent" : ["NewWordsWithFriendsFree/2.3 CFNetwork/672.1.15 Darwin/14.0.0",     
                                             "iPhone3,1; iPhone OS 7.1.2; com.fingerarts.sudoku2; 143441-1,24"]})

keywords = qry.Keyword.tolist()                                                  
games = qry.Game.tolist()                                                        

def select(x):                                                                   
    for key, game in zip(keywords, games):                                       
        if key in x:                                                             
            return game                                                          

df["GameName"] = df["user-agent"].apply(select)  

This will give:

In [41]: df
Out[41]: 
                                          user-agent              GameName
0  NewWordsWithFriendsFree/2.3 CFNetwork/672.1.15...    Words With Friends
1  iPhone3,1; iPhone OS 7.1.2; com.fingerarts.sud...  Sudoku by FingerArts

If you need to do that for large dataset you need to test the performance of this solution and see if it is fast enough for your purpose.

If not, perhaps optimize for example the way the strings are tested:

Having an outer loop for all possible games and then use .apply to return the results per game per column could speed things up as it will avoid the loop over all games in each call to select() etc.

To identify the bottlenecks you could use line_profiler (see How can I profile python code line-by-line?).

chris-sc
  • 1,698
  • 11
  • 21
1
df = pd.DataFrame({'date' : ['2015-09-02 13:45:56' , '2015-08-31 23:04:21'] , 'user-agent' : ['NewWordsWithFriendsFree/2.3 CFNetwork/672.1.15 Darwin/14.0.0' , 'iPhone3,1; iPhone OS 7.1.2; com.fingerarts.sudoku2; 143441-1,24']  })

map_df = pd.DataFrame({'Keyword' :  ['NewWordsWithFriends' , 'com.fingerarts.sudoku'], 'Game' : [ 'Words With Friends', 'Sudoku by FingerArts'] })

mapping = {vals[1] : vals[0] for vals in  map_df.values}


regex = '|'.join([keyword.replace('.' , '\.') for keyword in map_df['Keyword']])

def get_keyword(user_agent):
    matches = re.findall(regex ,user_agent)
    return matches[0] if len(matches) > 0 else np.nan


df['GameName'] = df['user-agent'].apply(get_keyword)

df['GameName'] = df['GameName'].map(mapping)

another implementation for get_keyword function could be

def get_keyword(user_agent):
    for keyword in map_df['Keyword']:
        if keyword in user_agent:
            return keyword

also another way to obtain the mapping is to create a series

mapping = pd.Series(map_df['Game'].values , index = map_df.Keyword )
Nader Hisham
  • 5,214
  • 4
  • 19
  • 35