0

I have thousands csv files names as follows file_x_x.csv where x is a number between 1 and 10000, in a same folder. Each file includes a header and one row of data:

file_1_1.csv

Name Surname Age Address
Michael O'Donnel 22 George St.

file_2_2.csv

Name Surname Age Address
Mary Jane 34 Camden St.

and so on.

I am looking for creating one single file including all these rows:

final_file.csv

Name Surname Age Address
Michael O'Donnel 22 George St.
Mary Jane 34 Camden St.

...

My approach:

import pandas as pd
import glob

path = # add path
all_files = glob.glob(path + ".csv") # look for all the csv files in that folder. Probably this is not the right code for looking at them

file_list = []

for filename in all_files:
    df = pd.read_csv(filename)
    file_list(df)

I do not know how to create one unique file at the end. Can you have a look at the code above and tell me how to get the desired output and if I missed something?

LdM
  • 674
  • 7
  • 23
  • What you described isn't actually a csv file. CSV means "Comma Separated Values" and they should be separated by `,` not a space, so pandas won't read the values correctly. – Thomas Q Oct 03 '21 at 23:29
  • @ThomasQ - there are several conventions for CSV - commas, tabs, vertical bars, spaces, semicolons. pandas lets you set the separator. There may be issues with how to deal with spaces internal to a cell that should be escaped, but generally it works. – tdelaney Oct 03 '21 at 23:32
  • @tdelaney I know there are many conventions, but even using `sep=" " `, pandas wouldn't have loaded the csv in the example correctly, since the addresses are `George St.` and `Camden St.` – Thomas Q Oct 03 '21 at 23:42
  • @ThomasQ - Right. CSV is an term for several similar formats and these examples aren't conformant with any of them. Best not to try to parse them and just copy if you can. – tdelaney Oct 03 '21 at 23:46
  • Possibly you can use glob library and use a for look for all the file which has extention `.csv` and read them and append then in one dataframe. If you have list of dataframe which generated from a for look then use `pd.concat[df1, df2]` and outside the for loop save it. – Palash Mondal Apr 27 '23 at 10:11

2 Answers2

3

You don't need to do anything complicated here. You know the header line and you know that you want the final to be everything except the header. Just open the files, skip the first line, and write. This is far more efficient than the memory consumption of a bunch of dataframes in memory.

import glob

with open("final_file.csv", "w") as outfile:
    for count, filename in enumerate(glob.glob(path + ".csv")):
        with open(filename) as infile:
            header = next(infile)
            if count == 0:
                outfile.write(header)
            line = next(infile)
            if not line.startswith("\n"):
                line = line + "\n"
            outfile.write(line)
tdelaney
  • 73,364
  • 6
  • 83
  • 116
  • is there a way to not manually write/select the header, but just to keep one from the first file? It might be in future to have more columns. thanks – LdM Oct 03 '21 at 23:46
2

I'd suggest using pd.concat to combine the DataFrames into one large DataFrame, which you can then save to a different file if you wish.

Before you concatenate the DataFrames, you may have to modify the call to pd.read_csv to ensure that the data is being processed correctly. If the example data in your question matches the contents of the CSV file verbatim, then the code snippet would look something like this:

import pandas as pd
import glob

path = "/my_path" # set this to the folder containing CSVs
names = glob.glob(path + "*.csv") # get names of all CSV files under path

# If your CSV files use commas to split fields, then the sep 
# argument can be ommitted or set to ","
file_list = pd.concat([pd.read_csv(filename, sep=" ") for filename in names])

#save the DataFrame to a file
file_list.to_csv("combined_data.csv")

Note that each row in the combined index will still be indexed based on the row number in its source file, creating duplicate row indices. To change that, call pd.DataFrame.reset_index()

file_list = file_list.reset_index(drop=True)
dwrodri
  • 122
  • 1
  • 4
  • 1
    You'll need `sep=" "` on the `read_csv`. – tdelaney Oct 03 '21 at 23:36
  • @tdelaney I originally made the assumption that the example data in OP's examples was actually just a printed DataFrame, instead of lines from the actual CSV. I'll clarify that further in my response. Thank you for pointing that out. – dwrodri Oct 03 '21 at 23:39
  • I received a message ValueError: No objects to concatenate in `names = glob.glob(path + ".csv") # get names of all CSV files under path`. I guess something is missing here (for example something to select all the csv files, regardless the name) – LdM Oct 03 '21 at 23:47
  • 1
    @LdM apologies, it should be `path + “*.csv”` instead of `path + “.csv”` I’ll edit my response – dwrodri Oct 04 '21 at 00:25