2

I am trying to merge three different large data frame(1400,000 rows), two dataframe are normal, and the third dataframe are from this mask = (df['a'].lt(25) & df['a'].gt(10)) | df['b'].gt(0.2) | df['c'].gt(500) df[mask] & df['e'].eq(0)`, accoring to my below sample data,

a      b        c       dt                   e   h i j k
35   0.1      234   2020/6/15 14:27:00       0   ........
1    0.1      554   2020/6/15 15:28:00       1   ........
2    0.2      654   2020/6/15 16:29:00       0   ........
23   0.4      2345  2020/6/15 17:26:00       0   ........
34   0.8      245   2020/6/15 18:25:00       0   ........
8    0.9      123   2020/6/15 18:26:00       0
7    0.1      22    2020/6/15 18:27:00       0
2    0.3      99    2020/6/15 18:28:00       0
219  0.2      17    2020/6/15 19:26:00       0

Below code will get to many useless and duplicated columns, is there any way to merge three different large data?

import pandas as pd
from functools import reduce
df1 = pd.read_csv('test1.csv')
df2 = pd.read_csv('test2.csv')
df = pd.read_csv('test.csv', usecols = ['a', 'b', 'c', 'dt', 'e'])
mask = (df['a'].lt(25) & df['a'].gt(10)) | df['b'].gt(0.2) | df['c'].gt(500)
df['x'] = mask.astype(int)
dfs = [df1, df2, df]
df_full = reduce(lambda left,right: pd.merge(left,right, on=['id']), dfs)
Elsa
  • 1
  • 1
  • 8
  • 27
  • Not easy answering because no sample data for `df1, df2`. Also if problem is `will get to many useless and duplicated columns` is possible filter non necessary columns with `usecols` like `f1 = pd.read_csv('test1.csv', usecols=['need col1','need col2'])` ? – jezrael Jul 01 '20 at 06:55
  • @jezrael do you know how to merge three data frame or not? if yes, you don't need my sample data of df1 and df2. – Elsa Jul 01 '20 at 07:15
  • Your code for merging is good, so ask for it. – jezrael Jul 01 '20 at 07:16
  • the code refers to this link, https://stackoverflow.com/questions/23668427/pandas-three-way-joining-multiple-dataframes-on-columns – Elsa Jul 01 '20 at 07:19
  • yes, code is good, problem is in data if `will get to many useless and duplicated columns` – jezrael Jul 01 '20 at 07:20

1 Answers1

0

Could you add some information about the dataframes? Do they have the same columns? Or does each frame have different columns and different number of rows? I am not asking about the data itself but its structure.

Also, I'm not too sure about this, but try to include 'id' in the columns to be read from the third dataframe, because it looks like you are trying to merge all three dataframes on that index.

usecols = ['id', 'a', 'b', 'c', 'dt', 'e']

(Edit: I tried merging without that 'id' and it didn't work for me.)

import pandas as pd

df1 = pd.read_csv('test1.csv', delimiter=";")
df2 = pd.read_csv('test2.csv', delimiter=";")
df = pd.read_csv('test3.csv', delimiter=";", usecols = ['id', 'a', 'b', 'c', 'dt', 'e'])

mask = (df['a'].gt(10) & df['a'].lt(25)) | df['b'].gt(0.2) | df['c'].gt(500)
df['x'] = mask.astype(int)
dfs = [df1, df2, df]

df_merged = df1.merge(df2, on=['id'])
df_merged = df_merged.merge(df, on=['id'])

I tried with my own CSV files:

  • test1.csv has some columns (id, a, b, c)
  • test2.csv has different columns (id, d, e, f)
  • test3.csv has the same columns as in your question: (id, a, b, c, dt, e)
  • df1 and df2 have many duplicated columns, and how to remove columns of df1 and df2? – Elsa Jul 01 '20 at 08:21
  • You could use `df1 = pd.read_csv('test1.csv', delimiter=";", usecols= [...])` and `df2 = pd.read_csv('test1.csv', delimiter=";", usecols= [...])` so you only read the columns that you want to merge (also make sure you include `'id'` in both lists) – Ignacio Hernández Jul 01 '20 at 08:26