Split CSV file into multiple (non-same-sized) files, keeping the headers

Question

I have a large csv (test.csv) with the following header columns id; type; name

and the following values:

1; A; ASW23
2; C; SDF92
3; D; SDI22
4; D; ASD00
5; C; WPE03
6; D; PPO30
7; A; WER34
8; C; FHH88
9; C; FGE45
10; A; DFQ12
11; G; WWQ89
12; C; YDT63
13; D; QTT21

The file is not ordered and I want a CSV file split every time it finds type A, keeping the same header. For example:

test_1.csv

id; type; name
1; A; ASW23
2; C; SDF92
3; D; SDI22
4; D; ASD00
5; C; WPE03
6; D; PPO30

test_2.csv

id; type; name
7; A; WER34
8; C; FHH88
9; C; FGE45

test_3.csv

id; type; name
10; A; DFQ12
11; G; WWQ89
12; C; YDT63
13; D; QTT21

I am struggling to make a python script for that, but I am failing.

I was thinking of something like that https://stackoverflow.com/a/36446203/1971010 by changing the row limit (1000) with a sort of a type 'A' counter — taleporos, Jun 18 '19 at 02:27

Ajax1234 · Accepted Answer · 2019-06-18T03:19:35.660

2

You can use itertools.groupby:

import itertools, csv
data = list(csv.reader(open('test.csv'), delimiter=';'))[1:]
new_d = [[a, list(b)] for a, b in itertools.groupby(data, key=lambda x:x[1]==' A')]
new_groups = [new_d[i][-1]+new_d[i+1][-1] for i in range(0, len(new_d), 2)]
for i, a in enumerate(new_groups, 1):
  with open('test_{}.csv'.format(i), 'w') as f:
    write = csv.writer(f, delimiter=';')
    write.writerows([['id', 'type', 'name']]+a)

test_1.csv:

id;type;name
1; A; ASW23
2; C; SDF92
3; D; SDI22
4; D; ASD00
5; C; WPE03
6; D; PPO30

test_2.csv:

id;type;name
7; A; WER34
8; C; FHH88
9; C; FGE45

test_3.csv:

id;type;name
10; A; DFQ12
11; G; WWQ89
12; C; YDT63
13; D; QTT21

edited Jun 18 '19 at 03:19

answered Jun 18 '19 at 02:27

Ajax1234

69,937
8
61
102

thank you very much, but I am trying to understand it, not just use it. For example if i wanted to do exactly the same in the column name for the value 'ASW23', what should I change, except of the x[1]=='ASW23' ?? – taleporos Jun 18 '19 at 02:37
1

@taleporos That is correct. You just need to change the value of the string in the comparison i.e `x[1]==' ASW23'`. Please note the space in front of the string; that is because the delimiter in the `csv` `read` function is itself being split upon, not including the space following it. – Ajax1234 Jun 18 '19 at 02:41
btw i am getting an error on your code: File "csv2.py", line 4 new_groups = [[new_d[i][-1][0], *new_d[i+1][-1]] for i in range(0, len(new_d), 2)] ^ SyntaxError: invalid syntax. It shows the `*new_d[i+1][-1]]` – taleporos Jun 18 '19 at 02:46
1

@taleporos Ah, `*` is unpacking syntax available in Python3 only. Please see my recent edit, as I added a solution to make the code compatible for Python2/Python3. Regarding a value change, you indeed only have to change the string in the body of the `key` `lambda`. `range(0, 1, len(new_d))` iterates only once, with a step value equal to the length of the structure. – Ajax1234 Jun 18 '19 at 02:53
edit. When I run it with python3 i am getting an other error `File "csv2.py", line 6 with open(f'test_{i}.csv', 'w') as f: ^ SyntaxError: invalid syntax` pointing the .csv' part – taleporos Jun 18 '19 at 02:55
1

@taleporos Opps, that is an `f-string` which is also not compatible with Python versions < 3.7. I changed it to simple string formatting with `format`. – Ajax1234 Jun 18 '19 at 02:57
`Traceback (most recent call last): File "csv2.py", line 4, in new_groups = [new_d[i][-1]+new_d[i+1][-1] for i in range(0, len(new_d), 2)] IndexError: list index out of range` I am so confused now :P – taleporos Jun 18 '19 at 03:00
@taleporos Is the file you are testing now the same as the one with the original data in your post? – Ajax1234 Jun 18 '19 at 03:16
@ Ajax1234 yeah, but it has also the header and more lines – taleporos Jun 18 '19 at 03:18
1

@taleporos Ok. I added `[1:]` to my solution above to remove the header from the `data` read in from the original test file. – Ajax1234 Jun 18 '19 at 03:20
Perfect!! Thank you very much! – taleporos Jun 18 '19 at 03:23
@taleporos Glad to help! – Ajax1234 Jun 18 '19 at 03:26
one last question to understand the mechanism, even if i have more columns lets say 10 and i want to follow the same technique looking for the word `X` in column number 8 for example, what should I change? Only `x:x[7]=='X'`?? I am just trying to understand that line: `new_groups = [new_d[i][-1]+new_d[i+1][-1] for i in range(0, len(new_d), 2)]` – taleporos Jun 18 '19 at 03:31
1

@taleporos Right, you only have to change the string comparison to `x:x[7]==' X'`. `new_groups = [new_d[i][-1]+new_d[i+1][-1] for i in range(0, len(new_d), 2)]` uses indexing to combine the elements of `new_d` into groups of two, the list containing the matched string i.e `'A'`, and all the subsequent entries which do not contain `'A'`? – Ajax1234 Jun 18 '19 at 03:43

score 1 · Answer 2 · answered Jun 18 '19 at 03:12

Approach using pandas.

>>> df = pd.read_csv('test.csv', sep=';')
>>> df.columns = [col.strip() for col in df.columns]
>>> df['cutter'] = pd.np.where(df['type'].str.strip() == 'A', 1, 0).cumsum()
>>> df
    id type    name  cutter
0    1    A   ASW23       1
1    2    C   SDF92       1
2    3    D   SDI22       1
3    4    D   ASD00       1
4    5    C   WPE03       1
5    6    D   PPO30       1
6    7    A   WER34       2
7    8    C   FHH88       2
8    9    C   FGE45       2
9   10    A   DFQ12       3
10  11    G   WWQ89       3
11  12    C   YDT63       3
12  13    D   QTT21       3

>>> gb = df.groupby('cutter')
>>> for i, x in enumerate(gb.groups):
...     gb.get_group(x).to_csv(f'test_{i}.csv', index=False)
...

RESULT

test_1.csv

   id type    name  cutter
0   1    A   ASW23       1
1   2    C   SDF92       1
2   3    D   SDI22       1
3   4    D   ASD00       1
4   5    C   WPE03       1
5   6    D   PPO30       1

test_2.csv

   id type    name  cutter
0   7    A   WER34       2
1   8    C   FHH88       2
2   9    C   FGE45       2

test_3.csv

   id type    name  cutter
0  10    A   DFQ12       3
1  11    G   WWQ89       3
2  12    C   YDT63       3
3  13    D   QTT21       3

Split CSV file into multiple (non-same-sized) files, keeping the headers

2 Answers2