Python : group csv row by index

Question

I got a csv like that :

sku, col1, col2, test, col3, col4, col5, col6
123,456,99,A,,,,
234,786,99,,,,,
345,678,99,A,,,,
123,,,A,18-123,9999,,
234,,,,18-786,9999,,
345,,,A,12-678,9999,,
123,,,,,,18-123,9999
234,,,A,,,18-786,9999
345,,,,,,12-678,9999

it's a result after combining 3 csv (different header / sku can be in 3 files with different column filled..)

My goal is to "group by sku" column in order to get something like taht :

I'm pretty new in python, so please excuse me if it's a trivial question, but i'm lost.

And I didn't know in advance my column name except the "sku" column who whas the "primary" key.

Thanks

Space Impact · Answer 1 · 2018-08-30T07:30:33.763

3

Using pandas.groupby and first:

df.groupby(['sku'],as_index=False).first()
    sku col1    col2    test    col3    col4    col5    col6
0   123 456.0   99.0    A   18-123  9999.0  18-123  9999.0
1   234 786.0   99.0    A   18-786  9999.0  18-786  9999.0
2   345 678.0   99.0    A   12-678  9999.0  12-678  9999.0

For removing .0 and converting to int data type using astype:

df.groupby(['sku'],as_index=False).first().astype(int,errors='ignore')
    sku col1 col2   test    col3    col4    col5    col6
0   123 456    99   A     18-123    9999    18-123  9999
1   234 786    99   A     18-786    9999    18-786  9999
2   345 678    99   A     12-678    9999    12-678  9999

edited Aug 30 '18 at 07:30

answered Aug 30 '18 at 06:53

Space Impact

13,085
23
48

Thanks it's a nice (and short) solution, but it's adding .0 to my value. I try to force it to "object type" but still there. Thanks – Mitchum Aug 30 '18 at 07:16
@Mitchum Check the update and [`accept the solution`](https://stackoverflow.com/help/someone-answers) if it helped! – Space Impact Aug 30 '18 at 07:35
Thanks but hart to do cause in my "real" files i got some text and other stuff if i put everything to int it's not working, as string also and object none. but i keep it for an other filter i have to do on int only. thanks ! – Mitchum Aug 30 '18 at 07:41

score 1 · Accepted Answer · answered Aug 30 '18 at 07:01

Simple solution preserving order

import csv
from collections import OrderedDict

result = OrderedDict()
with open('data.csv', 'rb') as csvfile, open("out.csv", "w") as outfile:
    csvreader = csv.reader(csvfile, delimiter=',', quotechar='"')
    header = False
    for row in csvreader:
        if not header:
            header = True
            size = len(row)
        sku = row[0]
        existing_val = result.setdefault(sku, [''] * size)
        for idx, each_val in enumerate(row):
            existing_val[idx] = existing_val[idx] or each_val


    csvwriter = csv.writer(outfile, delimiter=',', quotechar='"')
    for each_sku, val_list in result.iteritems():
        csvwriter.writerow(val_list)

You're welcome and thanks for accepting the solution! – Arun Kumar Nagarajan Aug 30 '18 at 07:40 — Arun Kumar Nagarajan, Aug 30 '18 at 07:40

score 0 · Answer 3 · answered Aug 30 '18 at 06:54

0

If you can influence the initial processing of the csv files, I think what you want to look for is a join. Just read all the csv files and join them together on the "sku" column. However, it depends on the library or backend you want to use. If you are using something like spark, or pandas you can directly use joins. If you use plain python you would need some library or write the join by yourself, as it was described here: Stackoverflow Article for Joining

answered Aug 30 '18 at 06:54

gaw

1,960
2
14
18

Thanks, i put the whole thing here : https://gist.github.com/julienanquetil/c0072b47a609e3fd92b8270af8afba13 On the first part i get all my header from all my csv, write it to a csv and after write content. I try with panda without any luck to do it in one shot. Thanks – Mitchum Aug 30 '18 at 07:19

score 0 · Answer 4 · answered Aug 30 '18 at 06:58

Using the inbuild csv and collection module

Ex:

import csv
from collections import defaultdict


d = defaultdict(dict)
with open(filename, "rU") as infile:
    reader = csv.DictReader(infile)
    for line in reader:
        d[line["sku"]].update({k.strip(): v for k,v in line.items() if v})


with open(filename1, "w") as outfile:
    writer = csv.DictWriter(outfile, fieldnames=["sku", "col1", "col2", "test", "col3", "col4", "col5", "col6"])
    writer.writeheader()
    for k, v in d.items():
        writer.writerow(v)

U13-Forward · Answer 5 · 2018-08-30T07:52:48.887

0

Workable way:

newdf=pd.DataFrame()
newdf['sku']=df['sku'][:3]
for i in df.columns.tolist():
    if i != 'sku':
        newdf[i]=df[i].dropna().tolist()
print(newdf)

edited Aug 30 '18 at 07:52

answered Aug 30 '18 at 07:26

U13-Forward

69,221
14
89
114

Python : group csv row by index

5 Answers5