0

Would really appreciate some help with the following problem. I'm intending on using Pandas library to solve this problem, so would appreciate if you could explain how this can be done using Pandas if possible.

I want to take the following excel file:

Before

and:

  • 1)convert the 'before' file into a pandas data frame
  • 2)look for the text in 'Site' column. Where this text appears within the string in the 'Domain' column, return the value in 'Owner' Column under 'Output'.
  • 3)the result should look like the 'After' file. I would like to convert this back into CSV format.

After

So essentially this is similar to an excel vlookup exercise, except its not an exact match we're looking for between the 'Site' and 'Domain' column.

I have already attempted this in Excel but im looking at over 100,000 rows, and comparing them against over 1000 sites, which crashes excel.

I have attempted to store the lookup list in the same file as the list of domains we want to classify with the 'Owner'. If there's a much better way to do this eg storing the lookup list in a separate data frame altogether, then that's fine.

Thanks in advance for any help, i really appreciate it.

Colin

Colin Blyth
  • 83
  • 1
  • 7
  • 1
    Possible duplicate: http://stackoverflow.com/questions/38291908/excel-vlookup-equivalent-in-pandas –  Feb 02 '17 at 18:47
  • 1
    Possible duplicate of [vlookup in Pandas using join](http://stackoverflow.com/questions/25493625/vlookup-in-pandas-using-join) – Zeugma Feb 02 '17 at 20:10

1 Answers1

0

I think the OP's question differs somewhat from the solutions linked in the comments which either deal with exact lookups (map) or lookups between dataframes. Here there is a single dataframe and a partial match to find.

import pandas as pd
import numpy as np

df = pd.ExcelFile('data.xlsx').parse(0)
df = df.astype(str)
df['Test'] = df.apply(lambda x: x['Site'] in x['Domain'],axis=1)
df['Output'] = np.where(df['Test']==True, df['Owner'], '')
df

The lambda allows reiteration of the in test to be applied across the axis, to return a boolean in Test. This then acts as a rule for looking up Owner and placing in Output.

joshi123
  • 835
  • 2
  • 13
  • 33
  • thanks joshi123. I've tried the above and the 'Output' column doesnt populate in my bigger data set. I could be wrong, but i think the above only looks at a single row at a time, searching for the 'Site' string within the 'Domain' column (on the same row), and if TRUE, it returns the string in column 'Owner'. However, what i want to do is look at 'domain' string in the first row, then scan down ALL strings in the 'Site' column and look for a partial match. Can you amend the above to achieve this? Thanks – Colin Blyth Feb 03 '17 at 14:12
  • i should add that in my dummy excerpt of the data, i have perhaps oversimplified the problem by having the 'site' which partially matches the 'domain' appearing on the same row. – Colin Blyth Feb 03 '17 at 14:20