Split large csv file into several based on condition

Question

I have one big csv file, which I would like to divide into several files using a condition.

This is the excerpt of the file:

Well    Zones   Top          Bottom
E18       A     4273714.58  412435.8397
E18       B     99966532.9  96167746.15
E18       C     48737298.05 9038801.616
E18       D     42762897.31 23612634.16
E22       A     706299.7065 0
E22       B     17639585.97 8763276.956
E22       C     7535182.443 756105.4564
E22       D     7858392.171 4359022.937
E22       A     202221.3273 0
E23       B     1876003.29  100344.7098
E23       C     39002827.89 32876455.8
E23       D     10856279.73 944615.8366

What I need to do is to create several files with, each having only one type of the "Zones" as following:

one csv file:

Well    Zones   Top         Bottom
E18      A      4273714.58  412435.8397
E22      A      706299.7065 0
E22      A      202221.3273 0

another csv file:

Well    Zones   Top         Bottom
E18     B       99966532.9  96167746.15
E22     B       17639585.97 8763276.956
E23     B       1876003.29  100344.7098

Surfing through the web I have only found splitting large files based on the number of rows/lines into different chunks... such as the code below:

  for i,chunk in enumerate(pd.read_csv(file, chunksize=3)):

Since you're using pandas, have you tried [`DataFrame.groupby()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) — jedwards, Jan 20 '19 at 15:35

score 2 · Answer 1 · answered Jan 20 '19 at 19:15

With Miller (http://johnkerl.org/miller/doc), a great command line utility, is very easy

mlr --tsv --from input.csv put -q 'tee > "Zone_".$Zones.".csv", $*'

Starting from (tab separated)

Well    Zones   Top     Bottom
E18     A       4273714.58      412435.8397
E18     B       99966532.9      96167746.15
E18     C       48737298.05     9038801.616
E18     D       42762897.31     23612634.16
E22     A       706299.7065     0
E22     B       17639585.97     8763276.956
E22     C       7535182.443     756105.4564
E22     D       7858392.171     4359022.937
E22     A       202221.3273     0
E23     B       1876003.29      100344.7098
E23     C       39002827.89     32876455.8
E23     D       10856279.73     944615.8366

You have

$ cat Zone_A.csv
Well    Zones   Top Bottom
E18 A   4273714.58  412435.8397
E22 A   706299.7065 0
E22 A   202221.3273 0

$ cat Zone_B.csv
Well    Zones   Top Bottom
E18 B   99966532.9  96167746.15
E22 B   17639585.97 8763276.956
E23 B   1876003.29  100344.7098

$ cat Zone_C.csv
Well    Zones   Top Bottom
E18 C   48737298.05 9038801.616
E22 C   7535182.443 756105.4564
E23 C   39002827.89 32876455.8

$ cat Zone_D.csv
Well    Zones   Top Bottom
E18 D   42762897.31 23612634.16
E22 D   7858392.171 4359022.937
E23 D   10856279.73 944615.8366

score 1 · Accepted Answer · answered Jan 20 '19 at 15:38

1

If you want to do it using pandas, then thread Select rows from a DataFrame based on values in a column in pandas should be helpful for you, especially unutbu answer, which explain usage of .loc method of pandas.DataFrame.

answered Jan 20 '19 at 15:38

Daweo

31,313
3
12
25

Thanks for the link, **unubtu's'** answer is really helpful – akkab Jan 21 '19 at 04:35

score 0 · Answer 3 · answered Jan 20 '19 at 15:45

You can use a collections.defaultdict to group the rows by Zones:

from collections import defaultdict
import csv
result = defaultdict(list)
_header, *data = csv.reader(open('filename.csv'))
for a, b, *c in data:
  result[b].append([a, b, *c])

for a, b in result.items():
   with open(f'zone_{a}.csv', 'w') as f:
     write = csv.writer(f)
     write.writerows([_header, *b])

Split large csv file into several based on condition

3 Answers3