0

Thanks to the help from some users of this sites.

My code seems to work fine, but it's taking too long..

I'm trying to compare two data frames.(df1 has 1,291,250 rows / df2 has 1,286,692 rows)

if df1.iloc[0,0] == df2.iloc[0,0] and df1.iloc[0,1] == df2.iloc[0,1], then compare df1.iloc[0,2], df2.iloc[0,2].

If the first(df1.iloc[0,2]) is larger, I want to put the first index into the list, and if the second(df2.iloc[0,2]) is larger, I want to put the second index into the list.

Example DataFrame

In [1]: df1 = pd.DataFrame([[0, 1, 98], [1, 1, 198], [2, 2, 228]], columns = ['A1', 'B1', 'C1'])
In [2]: df1
Out[3]:
         A1  B1   C1
      0   0   1   98
      1   1   1  198
      2   2   2  228

In [4]: df2 = pd.DataFrame([[0, 1, 228], [1, 2, 110], [2, 2, 130]], columns = ['A2', 'B2', 'C2'])
In [5]: df2
Out[6]:
         A2  B2   C2
      0   0   1  228
      1   1   2  110
      2   2   2  130

In [7]: def find_high(df1, df2) # def function code is below
Out[8]: ([2], [0])  # The result what i want
    

This is just simple example. my data is bigger than this

my code is:

for i in range(60):
    setattr(mod, f'df_1_{i}', np.array_split(df1, 60)[i])
    getattr(mod, f'df_1_{i}').to_pickle(f'df_1_{i}')


import glob
files = glob.glob('df_1_*')


def find_high_pre(df1 = files, df2):
    subtract_df2 = []
    subtract_df1 = []
    same_data = []
    for df1_index, line in enumerate(df1.to_numpy()):
        for df2_idx, row in enumerate(df2.to_numpy()):
            if (line[0:2] == row[0:2]).all():
                if line[2] < row[2]:
                    subtract_df2.append(df2_idx)
                    break
                elif line[2] > row[2]:
                    subtract_df1.append(df1_idx)
                    break
            else:
                continue
            break

    return df1.iloc[subtract_df1].index.tolist(), df2.iloc[subtract_df2].index.tolist(), df1.iloc[same_data].index.to_list();

data_1 = []
for i in files:
    e_data = pd.read_pickle(i)
    num_cores = 30
    df_split = np.array_split(e_data, num_cores)
    data_1 += parmap.map(find_high_pre, df_split, pm_pbar=True, pm_processes =num_cores)
Rachel Gallen
  • 27,943
  • 21
  • 72
  • 81
tony
  • 1
  • 1
  • 5
    "slow" you are using nested for loop that means `1,291,250 * 1,286,692` – deadshot Mar 06 '21 at 14:26
  • 1
    Showing examples of your dfs and expected output using small cases would make it easier to answer. See [how to make good reprexes in pandas](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – ggorlen Mar 06 '21 at 15:27

1 Answers1

0

My code seems to work fine, but it's taking too long..

Chances are that replacing your nested for loops with a DataFrame.merge operation will take less time:

keys = ['A', 'B']
df1.columns = [*keys, 'C1']
df2.columns = [*keys, 'C2']
df = df1.reset_index().set_index(keys).merge(
     df2.reset_index().set_index(keys), on=keys)
# now we have a merged dataframe like this:
#      index_x   C1  index_y   C2
# A B                            
# 0 1        0   98        0  228
# 2 2        2  228        2  130
# therefrom we can easily extract the wanted indexes
data = [df.loc[df['C1'] > df['C2'], 'index_x'].values,
        df.loc[df['C1'] < df['C2'], 'index_y'].values]
Armali
  • 18,255
  • 14
  • 57
  • 171