Thanks to the help from some users of this sites.
My code seems to work fine, but it's taking too long..
I'm trying to compare two data frames.(df1
has 1,291,250 rows / df2
has 1,286,692 rows)
if df1.iloc[0,0] == df2.iloc[0,0]
and df1.iloc[0,1] == df2.iloc[0,1]
, then compare df1.iloc[0,2]
, df2.iloc[0,2]
.
If the first(df1.iloc[0,2]
) is larger, I want to put the first index into the list, and if the second(df2.iloc[0,2]
) is larger, I want to put the second index into the list.
Example DataFrame
In [1]: df1 = pd.DataFrame([[0, 1, 98], [1, 1, 198], [2, 2, 228]], columns = ['A1', 'B1', 'C1'])
In [2]: df1
Out[3]:
A1 B1 C1
0 0 1 98
1 1 1 198
2 2 2 228
In [4]: df2 = pd.DataFrame([[0, 1, 228], [1, 2, 110], [2, 2, 130]], columns = ['A2', 'B2', 'C2'])
In [5]: df2
Out[6]:
A2 B2 C2
0 0 1 228
1 1 2 110
2 2 2 130
In [7]: def find_high(df1, df2) # def function code is below
Out[8]: ([2], [0]) # The result what i want
This is just simple example. my data is bigger than this
my code is:
for i in range(60):
setattr(mod, f'df_1_{i}', np.array_split(df1, 60)[i])
getattr(mod, f'df_1_{i}').to_pickle(f'df_1_{i}')
import glob
files = glob.glob('df_1_*')
def find_high_pre(df1 = files, df2):
subtract_df2 = []
subtract_df1 = []
same_data = []
for df1_index, line in enumerate(df1.to_numpy()):
for df2_idx, row in enumerate(df2.to_numpy()):
if (line[0:2] == row[0:2]).all():
if line[2] < row[2]:
subtract_df2.append(df2_idx)
break
elif line[2] > row[2]:
subtract_df1.append(df1_idx)
break
else:
continue
break
return df1.iloc[subtract_df1].index.tolist(), df2.iloc[subtract_df2].index.tolist(), df1.iloc[same_data].index.to_list();
data_1 = []
for i in files:
e_data = pd.read_pickle(i)
num_cores = 30
df_split = np.array_split(e_data, num_cores)
data_1 += parmap.map(find_high_pre, df_split, pm_pbar=True, pm_processes =num_cores)