My input file is 20GB .txt file, So It faces performance issues when I test run the below code. pd.read_csv is taking more than 3 hours. Need to optimization in the reading stage.
Sample input file.
007064770000|SODIUM|95 MILLIGRAM|0
007064770000|MULTI|001|0
007064770000|PET STARCH FREE|NOT APPLICABLE|0
007064770000|GRAIN TYPE|FLOUR|0
003980010200|MULTI|001|0
003980010200|DEAL|NON-DEAL|0
003980010200|PRODUCT SIZE|1 COUNT|0
003980010200|BASE SIZE|1 COUNT|0
757582821517|HW APPLIANCES|001|0
757582821516|HW APPLIANCES|001|0
757582821517|PACKAGE GENERAL SHAPE|BOTTLE|0
757582821517|SYND FORM|CREAM|0
757582821517|FORM|CREAM|0
757582821517|TARGET SKIN CONDITION|DRY SKIN|0
003980010205|HW MEDICINE|NON-DEAL|0
003980010205|PRODUCT SIZE|1 COUNT|0
003980010205|BASE SIZE|1 COUNT|0
007064770054|SODIUM|95 MILLIGRAM|0
007064770054|HW SPORTS|001|0
007064770054|PET STARCH FREE|NOT APPLICABLE|0
007064770054|GRAIN TYPE|FLOUR|0
003980010312|HW DIAMETER|1 COUNT|0
003980010312|BASE SIZE|1 COUNT|0
Output file
UPC code HW APPLIANCES HW DIAMETER HW MEDICINE HW SPORTS
0 3980010205 NaN NaN NON-DEAL NaN
1 3980010312 NaN 1 COUNT NaN NaN
2 7064770054 NaN NaN NaN 001
3 757582821516 001 NaN NaN NaN
4 757582821517 001 NaN NaN NaN
Existing code
import pandas as pd
import datetime
df = pd.read_csv('sample.txt', sep='|', names=['upc_cd', 'chr_typ', 'chr_vl', 'chr_vl_typ'], engine='python')
df = df[df['chr_typ'].str.contains('HW ')]
df.sort_values('chr_typ')
df = (
df.iloc[:, :-1] # Remove last Column
.pivot(index=['upc_cd'], columns=['chr_typ'])
.droplevel(0, axis=1) # Fix Levels and axes names
.rename_axis('UPC code')
.rename_axis(None, axis=1)
.reset_index()
)
print(df)
df.to_csv('output.csv', sep=',', index=None, mode='w', encoding='utf-8')
Please suggest the modification to the code in order to reduce the running time