0

I have a 14 million row CSV file, with a date column (not the first column) that I want to filter and split the data by.

Currently, I am loading it into pandas dataframe to do it:

df = pd.read_csv(filepath, dtype=str) 

for date in df['dates'].unique():
    subset = df[df['dates'] == date]
    subset.to_csv(date + dest_path)

Is there a faster way to do this?

Filter out rows from CSV before loading to pandas dataframe gives an interesting solution but unfortunately my column to split by is not in the first column.

EDIT:

I purely need to split the csv files into each date. The resulting csv files are passed on to another team. I need all the columns, I do not want to change any data, I do not need to do any groupby.

yl_low
  • 1,209
  • 2
  • 17
  • 26
  • This is not 'splitting', you're trying to group-by date during the read operation, there's zero reason you can't simply do the group-by after the read. Your iterative append `subset.to_csv(date + dest_path)` inside a for-loop is lethal for performance. It'll be O(N²), and iterative-append will blow out memory anyway. Also, pd.read_csv reads datetimes natively, so don't do `pd.read_csv(... dtype=str)`; use [`parse_dates`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html). The one-line answer to this question is **Don't do this** on a 1000-row dataset, let alone 14M... – smci Dec 09 '18 at 01:28
  • @smci Not O(n^2) but O(np) where n if the number of lines and p is the number of unique dates. And it certainly qualifies as splitting: send each subset of the dataset to a different file. Where do you see iterative append? –  Dec 09 '18 at 01:34
  • @Jean-ClaudeArbaut: Yes my mistake, you're right that at least OP isn't doing iterative row-by-row append, in which case it would have been O(n²) = O(np) with a first-order assumption that the number of unique dates was proportionate to the number of rows: p = kn, hence O(np) = O(kn²) = O(n²). But anyway keeping rows for dates we don't want will still blow out memory, better to filter them out at read-time (or even a crude grep/awk/perl preprocess pass on the input dataset). – smci Dec 09 '18 at 01:39
  • @smci The best thing to do highly depends on several factors: what's the purpose of splitting the file? What's the file size, and how does it compare to RAM size? What's the size of the different parts? I can't answer any of these questions (and admittedly the OP should clarify). At least my suggestion allows to deal with the worst case (the data don't fit or barely fit in memory), but it might not be the best. As an example, I used it on a file that I couldn't import directly in Stata (though the concatenated dataset - after importing the parts - did actually fit) –  Dec 09 '18 at 01:43
  • You need to tell us your memory limit (1Gb? 4Gb), since you're guarantee to blow out memory if you try to read and filter/store/process all 14M rows. Also, `pd.read_csv(... dtype=str)` is absolutely terrible for memory use, you should specify date for the date-column, and integer/float/categorical/Boolean elsewhere as appropriate. As to the filtering, if you only want a specific date/date-range, filter them out at read-time (or even a crude grep/awk/perl preprocess pass on the input dataset) e.g. `awk/egrep "(Jan|Apr|Jul|Oct)" ...` on the date column – smci Dec 09 '18 at 01:43
  • @Jean-ClaudeArbaut: I know, and I already asked for the missing specifics. generally people don't have > 4Gb, and if we do any non-in-place operations, we can only use <2Gb at most. Depending on how many columns there are, `dytpe=str` can cause a single row to occupy Kb, then 14M rows will take Gb. – smci Dec 09 '18 at 01:46
  • **OP: please show us the columns names, sizes, dtypes: read in say 1000 rows, then show us the result of `df.info()`?** Also, use `pd.read_csv(dtype=dict.. /usecols` to specify while columns you can drop completely, and the most efficient dtype for the subset you need to keep. (If you only want the date column plus one other, then only specify them in `dtype/ usecols`) Avoid `dtype=str` like the plague - especially on raw, non-cleaned/duplicated strings. – smci Dec 09 '18 at 01:47

1 Answers1

0

The main problem is reading the whole dataset into memory. Typically, with a very large file, I need to read the file line by line because it does not fit in memory. So I split the file and only then I can work with the parts (either with Python/pandas, R or Stata, which all load the whole dataset in memory).

To split the file, I would write a CSV reader from scratch. Not too difficult if you don't have to handle separators inside quoted strings (otherwise it's not much more difficult with a regex). Probably possible with less code with the builtin csv module (I confess I have never used it).

The following splits the file according to the values of a variable. It's also possible to adapt the code to split at fixed number of lines, to filter, to add or delete variables...

import sys

def main(arg):
    if len(arg) != 3:
        print("csvsplit infile outfile variable")
        return

    input_name = arg[0]
    output_name = arg[1]
    split_var = arg[2]

    sep = ","
    outfiles = {}

    with open(input_name) as f:
        var = f.readline().rstrip("\r\n").split(sep)
        ivar = {name: i for i, name in enumerate(var)}
        ikey = ivar[split_var]

        for line in f:
            v = line.rstrip("\r\n").split(sep)
            key = v[ikey]
            if key not in outfiles:
                outfiles[key] = open("%s_%s.csv" % (output_name, key), "wt")
                outfiles[key].write(sep.join(var) + "\n")
            outfiles[key].write(line)

    for key, outfile in outfiles.items():
        outfile.close()

main(sys.argv[1:])
  • No. The question is misconceived, there's zero reason to read-then-split, better to only keep rows for a specific date/range. pandas has a native date parser, no reason to use dtype=str. – smci Dec 09 '18 at 01:34
  • @smci There can be many reasons to read then split (depends on what the OP wants to do with his files, which neither you or me knows). However, there are also reasons not to read the whole file even if it fits in memory (in some extreme cases near RAM size it can slow down the reading). And there is no reason to tell Python to convert to date if the goal is to send back data to files. Your two comments (to the question and this one) are pure nonsense. –  Dec 09 '18 at 01:37