0

I have 2 really big data frames (lets say A and B).My objective is to make a new dataframe(C) from A and B with an additional Boolean columns; True: if the row is in B and False if its not in B

I get all unique identities from smaller one(B) and stored in a list (its size is 73739559) then I have element matching by pandas apply function but it crushes frequently

df['responsive'] = df.apply(lambda row: row.FullPhoneNumber in FullPhoneNumber, axis = 1)

Now im trying using the following code

f_res = open(main_path +"responsive.csv", 'a')
f_irr = open(main_path +"irresponsive.csv", 'a')
res_writer = csv.writer(f_res)
irr_writer = csv.writer(f_irr)
df['responsive'] =False

for row in df.iterrows():
  if row[1]['FullPhoneNumber'] in FullPhoneNumber:
    row[1]['responsive']=True
    res_writer.writerow(row[1])
  else:
    irr_writer.writerow(row[1])

Buts its too slow.. I'm looking for something faster as I have +10GB

  • Is there any reason you are using pandas for this task? Seems to be it can be achieved with the standard library "csv" that doesn't necessarily keep the whole data in memory like pandas does. This might fix any crashing issue – Nicolò Gasparini Aug 24 '21 at 09:09
  • Since you need only unique identities you may try to use a `set` instead of a `list`. Check [this SO topic](https://stackoverflow.com/questions/2831212/python-sets-vs-lists) – gimix Aug 24 '21 at 09:09
  • You need to provide a [mcve]. Is `FullPhoneNumber` a list? Why did you use a list instead of a set? – juanpa.arrivillaga Aug 24 '21 at 09:14
  • after matching the unique identities i have to fetch that row and write into some other file – Ashba jawed Aug 24 '21 at 11:40
  • using pandas as i need to drop some columns – Ashba jawed Aug 24 '21 at 11:42

0 Answers0