0

Say I have a bigger dataframe A and a smaller dataframe B, which is also a subset of dataframe A. There is a matching key for both datasets, say it's called key.

I want to create a new dataframe, say C, which only keep rows in dataset A which are not in dataset B. For eg. if A contains 1000 rows and B contains 200 rows, therefore C should contain 1000-200 = 1800 rows.

What is the best way of doing this? Using either dataframes or numpy arrays would work.

Many thanks!

Leockl
  • 1,906
  • 5
  • 18
  • 51
  • 1
    This could be helpful, https://stackoverflow.com/questions/53645882/pandas-merging-101 – sushanth Jun 01 '20 at 04:46
  • `pandas` https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.subtract.html#pandas-dataframe-subtract – notacorn Jun 01 '20 at 04:47
  • `pyspark` https://stackoverflow.com/questions/29537564/spark-subtract-two-dataframes – notacorn Jun 01 '20 at 04:47
  • @notacorn, the 2 links you have given me is not really what I am after. The links you have given me appears to just subtract elements values, whereas I am after removing duplicate rows. – Leockl Jun 01 '20 at 10:26
  • 1
    did you even open up the links?... "will return a new DataFrame containing rows in dataFrame1 but not in dataframe2" – notacorn Jun 01 '20 at 14:51
  • Thanks @notacorn. I can see the statement "will return a new DataFrame containing rows in dataFrame1 but not in dataframe2" in the 2nd link you provided about but this is for Spark? I am using Python – Leockl Jun 05 '20 at 10:18

0 Answers0