How to extract a subset of a bigger dataset

Asked Jun 01 '20 at 04:43

Active Jun 01 '20 at 04:43

Viewed 67 times

Say I have a bigger dataframe A and a smaller dataframe B, which is also a subset of dataframe A. There is a matching key for both datasets, say it's called key.

I want to create a new dataframe, say C, which only keep rows in dataset A which are not in dataset B. For eg. if A contains 1000 rows and B contains 200 rows, therefore C should contain 1000-200 = 1800 rows.

What is the best way of doing this? Using either dataframes or numpy arrays would work.

Many thanks!

asked Jun 01 '20 at 04:43

Leockl

1,906
5
18
51

1

This could be helpful, https://stackoverflow.com/questions/53645882/pandas-merging-101 – sushanth Jun 01 '20 at 04:46
`pandas` https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.subtract.html#pandas-dataframe-subtract – notacorn Jun 01 '20 at 04:47
`pyspark` https://stackoverflow.com/questions/29537564/spark-subtract-two-dataframes – notacorn Jun 01 '20 at 04:47
@notacorn, the 2 links you have given me is not really what I am after. The links you have given me appears to just subtract elements values, whereas I am after removing duplicate rows. – Leockl Jun 01 '20 at 10:26
1

did you even open up the links?... "will return a new DataFrame containing rows in dataFrame1 but not in dataframe2" – notacorn Jun 01 '20 at 14:51
Thanks @notacorn. I can see the statement "will return a new DataFrame containing rows in dataFrame1 but not in dataframe2" in the 2nd link you provided about but this is for Spark? I am using Python – Leockl Jun 05 '20 at 10:18

How to extract a subset of a bigger dataset

0 Answers0