4

I am trying to do this ,

import glob

interesting_files = glob.glob("/home/tcs/PYTHONMAP/test1/*.csv") 

header_saved = False
with open('/home/tcs/PYTHONMAP/output.csv','wb') as fout:
    for filename in interesting_files:
        with open(filename) as fin:
            header =  next(fin)
            if not header_saved:
                fout.write(header)
                header_saved = True
            for line in fin:
                fout.write(line)

and getting

File "/home/tcs/.config/spyder-py3/temp.py", line 11, in <module>
    fout.write(header)

TypeError: a bytes-like object is required, not 'str'

I don't know much about python please help Also i want to know how to split 1 big csv into multiple csv with same header.

Shubham Chauhan
  • 45
  • 1
  • 2
  • 10
  • Take a look at Pandas. https://pandas.pydata.org and https://stackoverflow.com/questions/2512386/how-to-merge-200-csv-files-in-python – Aki003 Jul 20 '17 at 10:51
  • You are opening the `fout` file as binary by specifying `'wb'`. I think it should work if you specify `'w'` instead, for writing strings. You might also want to take a look at [the `csv` module](https://docs.python.org/3/library/csv.html). –  Jul 20 '17 at 10:52
  • Thanks a lot i did with system command also, can you describe how do we split large csv file into small size but header should be present in each split, Thanks in advance – Shubham Chauhan Jul 20 '17 at 11:10
  • sed 2d *.csv > /a2.csv . command is able to concat the same without python, – Shubham Chauhan Jul 20 '17 at 11:11
  • @ShubhamChauhan it can't work It will with repeat header – wyx May 31 '18 at 10:08

2 Answers2

19

Using pandas:

import pandas as pd

interesting_files = glob.glob("/home/tcs/PYTHONMAP/test1/*.csv") 
df = pd.concat((pd.read_csv(f, header = 0) for f in interesting_files))
df.to_csv("output.csv")

To get rid of duplicate rows as well:

import pandas as pd

interesting_files = glob.glob("/home/tcs/PYTHONMAP/test1/*.csv") 
df = pd.concat((pd.read_csv(f, header = 0) for f in interesting_files))
df_deduplicated = df.drop_duplicates()
df_deduplicated.to_csv("output.csv")

This will not get rid of duplicates as the dataframe is created, but after. So a dataframe gets created by concatenating all of the files. Then it is de-duplicated. The final dataframe can then be saved to csv.

RHSmith159
  • 1,823
  • 9
  • 16
  • 1
    is there any way I can also remove duplicates simultaneously? – Rahul Jul 20 '17 at 11:10
  • 1
    @Rahul do you mean duplicate rows? I've updated my answer to include a way to remove duplicate rows, hope this helps! :) – RHSmith159 Jul 20 '17 at 11:47
  • 2
    how can this be done though if the size of your data won't fit into memory? with this approach, `df` can become greater in size than the machines RAM is capable of holding. – alex Aug 27 '19 at 21:33
4
import glob
import csv
interesting_files = glob.glob("/home/tcs/PYTHONMAP/test1/*.csv") 

header_saved = False
with open('/home/tcs/PYTHONMAP/output.csv', 'w') as fout:
    writer = csv.writer(fout)
    for filename in interesting_files:
        with open(filename) as fin:
            header =  next(fin)
            if not header_saved:
                writer.writerows(header) # you may need to work here. The writerows require an iterable.
                header_saved = True
            writer.writerows(fin.readlines())
Rahul
  • 10,830
  • 4
  • 53
  • 88