-3

I am trying to merge 30K csvs in a directory with same headers and I want to merge them to one file. with the below code I can only merge but with same headers and I do not want to repeat the headers after where the new files is added.

import pandas as pd
f = r'path/*.csv
combined_csv = pd.concat([ pd.read_csv(f) for f in filenames ])

combined_csv.to_csv('output.csv', index=False, header=True)

Error:

Traceback (most recent call last):
  File "merg_csv.py", line 4, in <module>
    combined_csv = pd.concat([ pd.read_csv(f) for f in filenames ])
NameError: name 'filenames' is not defined

Edit: The Solution provided in the answer below works but after sometime the memory is used and the program freezes and also freezes my screen.

import glob
import pandas as pd 

all_data = pd.dataFrame()

dfs = []

for f in glob.glob("*.csv"):
    df = pd.read_csv(f, error_bad_lines=False)

    dfs.append(df)

all_data = pd.concat(dfs, ignore_index=True)

all_data.to_csv("00_final.csv", index=None, header=True)

How can I merge and write into the output file at same time so that I will not face the low memory error. The size of the inputs is about 1.5gb and the number of files are more than 60K

Thank in advance !!

Sitz Blogz
  • 1,061
  • 6
  • 30
  • 54
  • 1
    What is the problem you're running into? – pvg May 24 '17 at 21:10
  • @pvg Updated the question .. with change in code and error – Sitz Blogz May 24 '17 at 21:18
  • 1
    That doesn't really have anything to do with pandas or headers. Seems like you want to glob that pattern and then iterate over the filenames it generates. You should look up how to do that since the way you're trying it is very much not it. – pvg May 24 '17 at 21:21
  • 1
    See https://stackoverflow.com/questions/3964681/find-all-files-in-a-directory-with-extension-txt-in-python and many other similar answers. – pvg May 24 '17 at 21:22

1 Answers1

1

Your issue seems to be in the for loop. The syntax is incorrect.

Try this :

from glob import glob
all_df = []
for f in glob('path/*.csv'):
    temp_df = pd.read_csv(f)
    all_df.append(temp_df)
final_df = pd.concat(all_df)
Spandan Brahmbhatt
  • 3,774
  • 6
  • 24
  • 36
  • Wen concat is used all the headers also taken.. I need headers only 1 time in the csv after merging – Sitz Blogz May 24 '17 at 21:45
  • `pd.concat` would not give you multiple headers. My understanding is each file has its own header. Correct me if I am wrong. If the files have no header , let me know so I can modify the code accordingly. – Spandan Brahmbhatt May 24 '17 at 21:47
  • Each file has same headers.. And i want to merge all those file to one big file for future process.. – Sitz Blogz May 24 '17 at 21:48
  • 1
    This should work. You will have only 1 row (top row) as header. Rest all the rows will be your data. – Spandan Brahmbhatt May 24 '17 at 21:50
  • Looks like I have some error rows and that is giving me error for merging.. I used `error_bad_line=False` but that also gives error. – Sitz Blogz May 24 '17 at 22:13
  • The number of files are 76,600 K, they have 'n' number of columns but due to noise in the data they have n+1 or n+2 columns and when such situation occurs the merging stops with an error, usually in pandas we can "error_bad_lines=False' but in this case I cannot use that and hence I am not able to get Complete merging done.. – Sitz Blogz May 25 '17 at 19:44