0

Hello everyone I am kinda new to this, and I don't know what to do anymore, the problem that I have is the following one : I have a dataset with 2 columns and 325729 rows. I create 10.000 random numbers all different. I create a bitvector of size 325729 with 1 for each of the 10000 and zero for the rest. Now I need to do a for loop of the dataset and take each row and check the value. If the value of both values of each row is contained in the 10.000 random number then I don't drop it.

The problem is that it take forever and the last time it run for 3h and it did not finish. I don't know what to do anymore at this point.

I will add the code that I am running.

# Import the data : 

import pandas as pd
df22=pd.read_table('web-NotreDame.txt',header=None)

# Create the bitvector and the random variables

data123 = np.random.randint(0,10000,size=10000)
data123.sort()

print(len(data123))
uniques = np.unique(data123)
print(len(uniques))

data1234 =  [0] * 325729

for val in uniques:
    data1234[val] = 1

# Dropping the rows

for ind in df22.index:
    if(data1234[df22[0][ind]] !=1 and data1234[df22[1][ind]] !=1):
        df22.drop(ind)

If someone can give me a hand by telling me how can I make the program at least finish, by the way, yes I checked other solutions on Stackoverflow but it did not work out so this is my last solution. Thank you in advance for your help.

Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
Alex97
  • 401
  • 1
  • 8
  • 21
  • doesnt this lead to duplicate random numberS? – Patrick Artner Sep 27 '21 at 09:54
  • I used unique so from 10 000 I have 7000 8000 of unique numbers without duplicates, this is not the problem – Alex97 Sep 27 '21 at 09:55
  • why so you not use `df.sample(n=10000)` to get the random 10k rows and remove all the random stuff? ==> [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html) – Patrick Artner Sep 27 '21 at 09:55
  • Ok I will use that also but the problem is not there. the problem is in the last for loop – Alex97 Sep 27 '21 at 09:56
  • Yes ok but this is not the point of my problem , for me it can be 10k or 8k or 20k it is just a number, my problem is the dropping part not the random part – Alex97 Sep 27 '21 at 09:58
  • 1
    It is not good to remove data in loop. You could initialize an array with the index of data to be dropped and then loop on this array to drop data. – Ptit Xav Sep 27 '21 at 10:00
  • Ptit Xav I got confused, so by saving the indexes that have to be dropped and then used that array or whatever inside another for loop to drop the rows isn't that the same thing but did in a different for loop ? – Alex97 Sep 27 '21 at 10:03
  • 1
    If youz have 10 rows and remove the 2, 5 and 9th - after removing the 2nd the former 5th will now be on place 4 and the former 9th will be on place 8 - if you now remove the 5th there will no longer be a 9th one because that one moved to 8th now. Removing stuff in a loop on a pandas dataframe is inefficient - there are better ways to do it. Please prepare a hardcoded dataframe of 10 elements, prepare a "bit-array" of also 10 elements and drop like 4 of them. As all the random and exact numbers are not your problem, remove them and make this a [mre] that gives inputs+outputs and whats wrong – Patrick Artner Sep 27 '21 at 10:06
  • maybe look into [how-to-delete-rows-from-a-pandas-dataframe-based-on-a-conditional-expression](https://stackoverflow.com/questions/13851535/how-to-delete-rows-from-a-pandas-dataframe-based-on-a-conditional-expression) and [random-row-selection-in-pandas-dataframe](https://stackoverflow.com/questions/15923826/random-row-selection-in-pandas-dataframe) if you haven't yet. – Patrick Artner Sep 27 '21 at 10:07
  • `df22.drop(index=uniques)`, or `df22.drop(index=df22.index[uniques])` if the index is not the same as the row numbers. – Cimbali Sep 27 '21 at 10:11

0 Answers0