Which method is more efficient for comparing two large (8GB & 5GB) csv
files? The output should contain every id that is not in file1
.
The data is a single column with GUIDs.
Method 1:
df = pd.read_csv(file)
df1 = pd.read_csv(file1)
df = df.merge(df1, on=['id'], how="outer", indicator=True).query('_merge=="left_only"')
df['id'].to_csv(output_path, index=False)
Method 2:
with open(file1, 'r') as t1:
file = set(t1)
with open(file, 'r') as t2, open(output_path, 'w') as outFile:
for line in t2:
if line not in file:
outFile.write(line)