0

My question is about using a regex pattern efficiently to find matches between two pandas df extracted from excel files

First, thanks for taking the time to look at my issue. I am very new to python, and even more so when using it to deal with large datasets. I know that I should use stack as a very last resort.

Now I'm hitting my head against a wall for something that's probably simple.

The problem : I have two excel files :

One is only 1 columns * 300 rows =original urls

The other one can be very large from 20k and more translated urls

The purpose : I have original urls and I have to find their translated counterpart that are in the big excel file.

The only common point between the two urls is an 8 digits number somewhere in the url

I have thought about merging but it won't work since it is a partial match (to the best of my knowledge.) I have found a potentially interesting solution : .where which could allows me to do exactly what I want. However I get this error:

master_list["translated"] = crawlfr.url.where(number_search.search(master_list).group(0) == number_search.search(crawl_fr).group(0), master_list.url) TypeError: expected string or buffer

As I understand this error, the issue could come from regex that doesn't seem to take anything else but a string.

In fact when I compare two strings by extracting the match objects in function like this it works.

def skl_finder(master_list,crawl_fr):
    skl_search=re.compile("\d{8}")
    if skl_search.search(master_list).group(0) == 
    skl_search.search(crawl_fr).group(0):
        return skl_search.search(master_list).group(0)

I think this question is very close to what I want to do but it did not have any reply: Pandas: Comparing two dataframes with identical data structure but differences in data using pattern matching

import regex as re
import pandas as pd
crawl_loc="translated_file_set.xlsx"

master_list_loc="original_fileset.xlsx"

crawlfr=pd.read_excel(crawl_loc,parse_cols="E")

master_list=pd.read_excel(master_list_loc)

number_search=re.compile("\d{8}")

master_list["translated"] = 
crawlfr.url.where(number_search.search(master_list).group(0) == 
number_search.search(crawl_fr).group(0), master_list.url)


master_list.to_excel("result_file.xlsx")
Harsh Patel
  • 6,334
  • 10
  • 40
  • 73
Thomas meriaux
  • 87
  • 1
  • 11
  • not clear what your issue is, speed? IMO you could pre-compute search key, that is, extract digits in files and convert to integer as a new column, then search on them. – georgexsh Sep 15 '17 at 10:19
  • Both speed and how to apply regex between two columns. as mentioned I have errors with regex and I can't find a working solution to use matching pattern between two columns in this case. – Thomas meriaux Sep 15 '17 at 10:24
  • transfer excel into list of strings, compute search key on them, will easier compare with pandas IMO, which you are not familiar with. – georgexsh Sep 15 '17 at 12:20

1 Answers1

1

Make sure your url cols are dtype str.

Try creating new cols with the 8 digit id for each:

crawlfr['url_id'] = crawlfr.url.str.extract("\d{8}")
master_list['url_id'] = master_list.url.str.extract("\d{8}")

Then join on url_id:

crawlfr.join(master_list, on='url_id')
Kyle
  • 2,814
  • 2
  • 17
  • 30