9

I have 18 csv files, each is approximately 1.6Gb and each contain approximately 12 million rows. Each file represents one years' worth of data. I need to combine all of these files, extract data for certain geographies, and then analyse the time series. What is the best way to do this?

I have tired using pd.read_csv but i hit a memory limit. I have tried including a chunk size argument but this gives me a TextFileReader object and I don't know how to combine these to make a dataframe. I have also tried pd.concat but this does not work either.

ChrisB
  • 103
  • 1
  • 1
  • 5
  • Does it need to be with pandas? Is the csv data format the same across all the files? If they are, you could just look into reading / writing the source / destination files line-by-line, avoiding the memory issue. – martyn Jun 07 '19 at 12:18
  • 1
    You can try using [dask](https://dask.org/), as it is better suited to manage such large files in memory. – AlCorreia Jun 07 '19 at 12:20
  • Possible duplicate of [Reading a huge .csv file](https://stackoverflow.com/questions/17444679/reading-a-huge-csv-file) – PV8 Jun 07 '19 at 12:24
  • there are several discussions about this topic: https://stackoverflow.com/questions/17444679/reading-a-huge-csv-file – PV8 Jun 07 '19 at 12:24
  • @martyn It doesn't need to be with pandas but as a beginner i don't know what else i can use. – ChrisB Jun 07 '19 at 12:33

3 Answers3

21

Here is the elegant way of using pandas to combine a very large csv files. The technique is to load number of rows (defined as CHUNK_SIZE) to memory per iteration until completed. These rows will be appended to output file in "append" mode.

import pandas as pd

CHUNK_SIZE = 50000
csv_file_list = ["file1.csv", "file2.csv", "file3.csv"]
output_file = "./result_merge/output.csv"

for csv_file_name in csv_file_list:
    chunk_container = pd.read_csv(csv_file_name, chunksize=CHUNK_SIZE)
    for chunk in chunk_container:
        chunk.to_csv(output_file, mode="a", index=False)

But If your files contain headers than it makes sense to skip the header in the upcoming files except the first one. As repeating header is unexpected. In this case the solution is as the following:

import pandas as pd

CHUNK_SIZE = 50000
csv_file_list = ["file1.csv", "file2.csv", "file3.csv"]
output_file = "./result_merge/output.csv"

first_one = True
for csv_file_name in csv_file_list:

    if not first_one: # if it is not the first csv file then skip the header row (row 0) of that file
        skip_row = [0]
    else:
        skip_row = []

    chunk_container = pd.read_csv(csv_file_name, chunksize=CHUNK_SIZE, skiprows = skip_row)
    for chunk in chunk_container:
        chunk.to_csv(output_file, mode="a", index=False)
    first_one = False
hafiz031
  • 2,236
  • 3
  • 26
  • 48
Nguyen Van Duc
  • 1,019
  • 10
  • 9
  • 2
    You should add header=False to to_csv(), otherwise every time you write a chunk a header will be written. In my case, my input data did not have a header, so read_csv() interpreted the first line as header and to_csv() inserted the first line when writing every chunk. If you need the first lines from the input files, then add header=None to read_csv(). – codingmonkey87 Jul 30 '20 at 08:48
3

The memory limit is hit because you are trying to load the whole csv in memory. An easy solution would be to read the files line by line (assuming your files all have the same structure), control it, then write it to the target file:

filenames = ["file1.csv", "file2.csv", "file3.csv"]
sep = ";"

def check_data(data):
    # ... your tests
    return True # << True if data should be written into target file, else False

with open("/path/to/dir/result.csv", "a+") as targetfile:
    for filename in filenames :
        with open("/path/to/dir/"+filename, "r") as f:
            next(f) # << only if the first line contains headers
            for line in f:
                data = line.split(sep)
                if check_data(data):
                    targetfile.write(line)

Update: An example of the check_data method, following your comments:

def check_data(data):
    return data[n] == 'USA' # < where n is the column holding the country
olinox14
  • 6,177
  • 2
  • 22
  • 39
  • 1
    Note that this will fail/behave weirdly if your separator character also appears inside the fields. You might need more sophisticated parsing for the line data in that case. – Tom Dalton Jun 07 '19 at 12:37
  • So does this create a csv file of the data that i want, which I then re-import this and do my analysis from that? – ChrisB Jun 07 '19 at 13:00
  • No, this will read all of your csv files line by line, and write each line it to the target file only if it pass the `check_data` method. (No memory was harmed while using this solution) – olinox14 Jun 07 '19 at 13:03
  • So if in the check_data function I want to only take rows with 'USA' in the 'Country' column for each of the 18 files, how would this be written? Sorry for the simple question. – ChrisB Jun 09 '19 at 09:38
1

You can convert the TextFileReader object using pd.DataFrame like so: df = pd.DataFrame(chunk), where chunk is of type TextFileReader. You can then use pd.concat to concatenate the individual dataframes.

Vishnu Dasu
  • 533
  • 5
  • 18