1

I am a python newbie, trying to figure out a problem using pandas.

I have two .csv files that I have imported as pandas dataframes.

one of these files is a file with rows for ID number, Start and End coordinates:

ID  Start  End
1   45     99
3   27     29
6   13     23
19  11     44

my second file has a columns for a code, and start and end coordinates as well:

Code  Start  End
ss13d  67    100
dfv45  55    100
aal33  101   222
mm0ww  24    28

I want to find start and end coordinates that overlap between both of these files in no particular order, so that the result would look something like this:

ID  Start  End  Code  Start  End
1   45     99   ss13d 67     100
1   45     99   dfv45 55     100
3   27     29   mm0ww 24     28

I have tried using pandas.merge(), but from what I understand the lists need to have columns in common. In this case it's my start columns, but I can't merge on those columns since they are the ones being compared.

For now I finally figured the logic behind how I would locate overlaps:

df = pd.read_csv (r'file1.csv')   
df2 = pd.read_csv ('file2.csv')

c= (df['Start'] <= df2['Start']) & (df['End'] >= df2['Start']) | (df['Start'] <= df2['End']) & (df['End'] >= df2['End'])

but I haven't had any luck getting anything to work.

Could someone point me in the right direction? Neither concat, nor merge works for me in this situation I think.

1 Answers1

0

So to start out,you should probably rename your columns so that you can tell which belongs to which dataframe, it'll make things easier when comparing them later.

df1 = df1.rename(columns={'Start': 'Start_1', 'End': 'End_1'}) 
df2 = df2.rename(columns={'Start': 'Start_2', 'End': 'End_2'}) 

Next, if you want to merge two dataframes, but don't have any column in common, you can simply create one:

df1["key"] = 0
df2["key"] = 0

Then you can merge on that column and drop it again

joined_df = pd.merge(df1, df2).drop(columns=['key'])

Finally, you can filter your columns based on overlap for example like this:

joined_df[(joined_df["Start_2"] > joined_df["Start_1"]) & (joined_df["Start_2"] < joined_df["End_1"])]

(Just a tip, use & and | as binary operators to combine filters and always put parenthesis around your bools.)

Hope this helps and good luck with pandas!

sigma1510
  • 1,165
  • 1
  • 11
  • 26
  • oh thanks! so from what I can see this has created a new column called "Key" on position zero, yes? This will be filled with a NaN value? – Daniil Ruvinskii Apr 08 '20 at 20:49
  • No, it has created in each dataframe a column named "key" which has value 0 in every row. Merge pairs each row from df1 which has value x in "key" with each row in df2 which was value x in "key". Since x=0 everywhere here, we are basically doing a cartesian product, every row from df1 is paired with every row from df2. Then we can filter and throw away anything that doesn't overlap. If you're using a notebook (jupyter notebook, google colab, etc...) you can print out intermediary steps, so it'll be easier to see what's happening. – sigma1510 Apr 08 '20 at 20:52
  • oh my god thank you so much you are brilliant!one last question- when I try to merge like so: joined_df = pd.merge(df1, df2).drop(columns=['key']) it still says I have no common columns to merge on. Perhaps I need to use some options, like right=? – Daniil Ruvinskii Apr 08 '20 at 21:20
  • Hmm, I can't replicate that error. You can try df1.merge(df2,how='outer'), it should be equivalent, and you don't absolutely need to drop "key", it's just clean-up, you can remove it for for debugging. Can you check both dfs have the key column before merging? – sigma1510 Apr 08 '20 at 21:29
  • the merging process is causing my system to freeze, I guess it's causing some sort of memory overflow. How could I work around this? – Daniil Ruvinskii Apr 09 '20 at 08:53
  • I have looked into this, and my merge command causes my session to use all available ram and crash. – Daniil Ruvinskii Apr 09 '20 at 09:58
  • Exactly how big are your files? – sigma1510 Apr 09 '20 at 12:04
  • not large- only 800kb and 1.3mb respectively. I know for a fact I am doing something wrong, I just can't quite figure it out. – Daniil Ruvinskii Apr 09 '20 at 12:15
  • Can you try and take, say, then top 10 lines from each file, and try out your functions? That way you can check if the problem is the code or the size of the data. – sigma1510 Apr 09 '20 at 12:17
  • it worked with the top ten lines, giving me a csv file with 81 rows and ignoring my filter I think. – Daniil Ruvinskii Apr 09 '20 at 12:26
  • So your problem is the memory size, then. Here are some other people who had the same problem, you can pick which solution you prefer: https://stackoverflow.com/questions/47386405/memoryerror-when-i-merge-two-pandas-data-frames https://stackoverflow.com/questions/55121153/how-to-perform-a-merge-of-too-large-dataframes?noredirect=1&lq=1 https://stackoverflow.com/questions/53699012/performant-cartesian-product-cross-join-with-pandas – sigma1510 Apr 09 '20 at 12:31