0

I have a 2 dataframe, in first - columns, where I should find some info second - column, what I should find in first dataframe and columns, what should I add if string from first column contain.

df1:

id    url  
111   vk.com/audio
222   twitter.com/chats

df2:

url   Maincategory   Subcategory
vk.com   Social Network    entertainment
twitter.com   Social Network   entertainment

If url column were match, I would use

df1['Main Category'] = df1.url.map(df2.set_index('url')['Maincategory'])

But it doesn't work to find substring. I use for that

mapping = dict(df2.set_index('url')['Maincategory'])
def map_to_substring(x):
    for key in mapping.keys():
        if key in x:
            return mapping[key]
    return 'None'

But if df is too much, it takes too much time. How can I improve this approach to do it faster?

Petr Petrov
  • 4,090
  • 10
  • 31
  • 68
  • If you're matching with the domain name, it could be worthwhile to add a column to your dataframe using `urlparse`. You could do exact matching on the `netloc`. Of course this won't work for arbitrary substrings, but it might work in your case. Reference: https://docs.python.org/2/library/urlparse.html – Mikk Jan 19 '17 at 13:30
  • @Mikk not always domain – Petr Petrov Jan 19 '17 at 13:43
  • *Note*: There is a solution [described by @unutbu](https://stackoverflow.com/a/48600345/9209546) which is more efficient than using `pd.Series.str.contains`. If performance is an issue, then this may be worth investigating. – jpp May 06 '18 at 22:18

1 Answers1

0

it is not clear what you are asking but you should use the pandas str.contains methods http://pandas.pydata.org/pandas-docs/stable/text.html

as a general rule, you can loop over each column in the first dataframe, and search for a match in the second one. there is no faster solution than this I think

ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235