15

Assume I'm dealing with a very large csv file. So, I can only read the data chunk by chunk into the memory. The expected flow of events should be as follows:

1) Read chunk (eg: 10 rows) of data from csv using pandas.

2) Reverse the order of data

3) Copy each row to new csv file in reverse. So each chunk (10 rows) is written to csv from beginning in reversed order.

In the end the csv file should be in reversed order and this should be done without loading entire file into memory for windows OS.

I am trying to do a time series forecasting I need data to be from old to latest (1st row oldest entry). I can't load entire file into memory I'm looking for a way to do it each chunk at a time if it's possible.

The dataset I tried on train.csv of the Rossmann dataset from kaggle. You can get it from this github repo

My attempt does not copy the rows into the new csv file properly.

Show below is my code:

import pandas as pd
import csv

def reverse():

    fields = ["Store","DayOfWeek","Date","Sales","Customers","Open","Promo","StateHoliday",
              "SchoolHoliday"]
    with open('processed_train.csv', mode='a') as stock_file:
        writer = csv.writer(stock_file,delimiter=',', quotechar='"', 
                                                quoting=csv.QUOTE_MINIMAL)
        writer.writerow(fields)

    for chunk in pd.read_csv("train.csv", chunksize=10):
        store_data = chunk.reindex(index=chunk.index[::-1])
        append_data_csv(store_data)

def append_data_csv(store_data):
    with open('processed_train.csv', mode='a') as store_file:
        writer = csv.writer(store_file,delimiter=',', quotechar='"',
                                           quoting=csv.QUOTE_MINIMAL)
        for index, row in store_data.iterrows():
            print(row)
            writer.writerow([row['Store'],row['DayOfWeek'],row['Date'],row['Sales'],
            row['Customers'],row['Open'],row['Promo'],
            row['StateHoliday'],row['SchoolHoliday']])

reverse()

Thank you, in advance

Sai Kumar
  • 665
  • 2
  • 9
  • 21
Suleka_28
  • 2,761
  • 4
  • 27
  • 43
  • 1
    You want the entire data to be time-sorted but you’re sorting & writing 10 data at a time FROM THE BEGINNING. That’s why you aren’t getting results. Why don’t you read & reverse the train.csv from the end. See, https://stackoverflow.com/a/10933932/2895956 – Sujay Kumar Oct 29 '18 at 07:03
  • 2
    I tried the code given in that post earlier. But it did not give a proper result for me. – Suleka_28 Oct 29 '18 at 07:21
  • 3
    Can you elaborate? What exactly you are getting? And if possible upload the file in dropbox. The process is too long, (to download that file a person have to login, register then fill out information...bla..bla..) just to download the file. – Sujay Kumar Oct 29 '18 at 07:29
  • 4
    Note: Not a perfect solution. But this you can simply revert the file using cmd line, tail -r train.csv > reverse.csv. And manually remove the last line and add it to the first line in reverse.csv – Sujay Kumar Oct 29 '18 at 09:20

5 Answers5

6

Using bash, you can tail the whole file except the first line and then reverse it and store it with this:

tail -n +2 train.csv  | tac > train_rev.csv

If you want to keep the header in the reversed file, write it first and then append the reversed content

head -1 train.csv > train_rev.csv; tail -n +2 train.csv  | tac >> train_rev.csv
gustavovelascoh
  • 1,208
  • 1
  • 14
  • 28
  • 1
    Unless you have a hard requirement to also run this on a different os, this is probably going to be the fastest/best answer around. – Julian Nov 20 '18 at 21:24
0

If you have sufficient hard disk space, you can read in chunks, reverse and store. Then pick up the stored chunks in reverse order and write to a new csv file.

Below is an example with Pandas which also uses pickle (for performance efficiency) and gzip (for storage efficiency).

import pandas as pd, numpy as np

# create a dataframe for demonstration purposes
df = pd.DataFrame(np.arange(5*9).reshape((-1, 5)))
df.to_csv('file.csv', index=False)

# number of rows we want to chunk by
n = 3

# iterate chunks, output to pickle files
for idx, chunk in enumerate(pd.read_csv('file.csv', chunksize=n)):
    chunk.iloc[::-1].to_pickle(f'file_pkl_{idx:03}.pkl.gzip', compression='gzip')

# open file in amend mode and write chunks in reverse
# idx stores the index of the last pickle file written
with open('out.csv', 'a') as fout:
    for i in range(idx, -1, -1):
        chunk_pkl = pd.read_pickle(f'file_pkl_{i:03}.pkl.gzip', compression='gzip')
        chunk_pkl.to_csv(fout, index=False, header=False if i!=idx else True)

# read new file to check results
df_new = pd.read_csv('out.csv')

print(df_new)

    0   1   2   3   4
0  40  41  42  43  44
1  35  36  37  38  39
2  30  31  32  33  34
3  25  26  27  28  29
4  20  21  22  23  24
5  15  16  17  18  19
6  10  11  12  13  14
7   5   6   7   8   9
8   0   1   2   3   4
jpp
  • 159,742
  • 34
  • 281
  • 339
  • Actually, each file chunk should be reversed and next each reversed chunk of rows should be prepended rather than appending to the csv file. So the entire file will be reversed at the end but chunk by chunk without loading entire file into memory. – Sai Kumar Nov 16 '18 at 23:56
  • @SaiKumar, Got it, have updated with a solution which performs a complete reversal. – jpp Nov 18 '18 at 18:40
0

I would not recommend using pandas for parsing or streaming any files as you are only introducing additional overhead. The best way to do it is to read the file from the bottom up. Well, a big part of this code actually comes from here where it takes in a file and returns the reverse in a generator, which I believe is what you want.

What I did was just tested it with your file train.csv from the provided link and output the results in a new file.

import os

def reverse_readline(filename, buf_size=8192):
    """a generator that returns the lines of a file in reverse order"""
    with open(filename) as fh:
        segment = None
        offset = 0
        fh.seek(0, os.SEEK_END)
        file_size = remaining_size = fh.tell()
        while remaining_size > 0:
            offset = min(file_size, offset + buf_size)
            fh.seek(file_size - offset)
            buffer = fh.read(min(remaining_size, buf_size))
            remaining_size -= buf_size
            lines = buffer.split('\n')
            # the first line of the buffer is probably not a complete line so
            # we'll save it and append it to the last line of the next buffer
            # we read
            if segment is not None:
                # if the previous chunk starts right from the beginning of line
                # do not concact the segment to the last line of new chunk
                # instead, yield the segment first 
                if buffer[-1] != '\n':
                    lines[-1] += segment
                else:
                    yield segment
            segment = lines[0]
            for index in range(len(lines) - 1, 0, -1):
                if lines[index]:
                    yield lines[index]
        # Don't yield None if the file was empty
        if segment is not None:
            yield segment

reverse_gen = reverse_readline('train.csv')

with open('rev_train.csv','w') as f:
    for row in reverse_gen:
        f.write('{}\n'.format(row))

It basically reads it in reverse until it finds a newline then yields a line from the file from the bottom to top. A pretty interesting way of doing it.

BernardL
  • 5,162
  • 7
  • 28
  • 47
  • 2
    This actually will result in overlapping and missing chunks of text if UTF-8 is used. You can't just seek in a variable-width encoded file! Next, if the CSV file has *embedded* newlines in cell values, reversing can't just split on newlines arbitrarily. – Martijn Pieters Nov 18 '18 at 14:24
  • Next, a RFC compliant CSV file will use `\r\n` line separators, your code assumes that `\n` is used, always, and that there are no embedded newlines in the cell values. – Martijn Pieters Nov 19 '18 at 08:57
0

This does exactly what you are requesting, but without Pandas. It reads intest.csv line by line (as opposed to reading the whole file into RAM). It does most of the processing using the file system using a series of chunk files that are aggregated at the end into the outtest.csv file. If you change the maxLines, you can optimise the number of chunk files produced versus RAM consumed (higher numbers consume more RAM but produce fewer chunk files). If you want to keep the CSV header first line, set keepHeader to True; if set to False, it reverses the entire file, including the first line.

For kicks, I ran this on an old Raspberry Pi using a 128GB flash drive on a 6MB csv test file and I thought something had gone wrong because it returned almost immediately, so it's fast even on slower hardware. It imports only one standard python library function (remove), so it's very portable. One advantage of this code is it does not reposition any file pointers. One limitation is it will not work on CSV files that have newlines in the data. For that use case, pandas would be the best solution to read the chunks.

from os import remove

def writechunk(fileCounter, reverseString):
    outFile = 'tmpfile' + str(fileCounter) + '.csv'
    with open(outFile, 'w') as outfp:
        outfp.write(reverseString)
    return

def main():
    inFile = 'intest.csv'
    outFile = 'outtest.csv'
    # This is our chunk expressed in lines
    maxLines = 10
    # Is there a header line we want to keep at the top of the output file?
    keepHeader = True

    fileCounter = 0
    lineCounter = 0
    with open(inFile) as infp:
        reverseString = ''
        line = infp.readline()
        if (line and keepHeader):
            headerLine = line
            line = infp.readline()
        while (line):
            lineCounter += 1
            reverseString = line + reverseString
            if (lineCounter == maxLines):
                fileCounter += 1
                lineCounter = 0
                writechunk(fileCounter, reverseString)
                reverseString = ''
            line = infp.readline()
    # Write any leftovers to a chunk file
    if (lineCounter != 0):
        fileCounter += 1
        writechunk(fileCounter,reverseString)
    # Read the chunk files backwards and append each to the outFile
    with open(outFile, 'w') as outfp:
        if (keepHeader):
            outfp.write(headerLine)
        while (fileCounter > 0):
            chunkFile = 'tmpfile' + str(fileCounter) + '.csv'
            with open(chunkFile, 'r') as infp:
                outfp.write(infp.read())
            remove(chunkFile)
            fileCounter -= 1

if __name__ == '__main__':
    main()
Mark Warburton
  • 471
  • 10
  • 18
  • 2
    This does work, but assumes that there are no newlines in the CSV cell values. – Martijn Pieters Nov 18 '18 at 14:32
  • Thanks, @MartijnPieters. You are correct. However, this was the use case expressed in the original question. I will update the answer to point out this limitation. This was designed to have the smallest footprint possible, but we could potentially use pandas' inbuilt chunk function to read bits of the file, which would overcome that limitation at the expense of efficiency (and a modest increase in memory requirement). – Mark Warburton Nov 18 '18 at 23:00
  • 1
    It absolutely is in the use case expressed, which asks to be able to handle *any* giant CSV file. – Martijn Pieters Nov 19 '18 at 08:58
-3

You have repeated code blocks and you are not taking advantage of pandas at all.

What @sujay kumar pointed out is very correct, I would read that more closely.

The file isnt big at all. I use OHLCV tick data that is in GBs without issues. If you use pandas.read_csv() you dont have to do chunked transfer. Sure it will take time but it will work fine. Unless you are going in Terrabytes. I have not tested with that.

when you read_csv() you dont specify any index. If you did you could call sort_index() with or without ascending=False depending on the order.

Pandas can write CSV too, please use that instead. I am pasting some example code for you to put together.

df_temp = pd.read_csv(file_path, parse_dates=True, index_col="Date", usecols=["Date", "Adj Close"], na_values=["nan"])

sorting a Series

s = pd.Series(list('abcde'), index=[0,3,2,5,4]) s.sort_index()

Note: If you stick to Pandas and its functions you will be running already optimized code which dont require entire files to be loaded in memory. It's so easy that its almost like cheating :)

Abhishek Dujari
  • 2,343
  • 33
  • 43