How to merge and combine rows with same id (index) in python?

Question

I am new in python and I am working with CSV file with over 10000 rows. In my CSV file, there are many rows with the same id which I would like to merge them in one and also combine their information as well.

For instance, the data.csv look like (id and info is the name of columns):

id| info

1112| storage is full and needs extra space

1112| there is many problems with space 

1113| pickup cars come and take the garbage

1113| payment requires for the garbage

and I want to get the output as:

id| info

1112| storage is full and needs extra space there is many problems with space

1113| pickup cars come and take the garbage payment requires for the garbage

I already looked at a few posts such as 1 2 3 but none of them helped me to answer my question.

It would be great if you could use python code to describe your help that I can also run and learn in my side.

Thank you

Guaz · Answer 1 · 2019-05-27T18:58:56.200

2

I think about some simplier way:

some_dict = {}
for idt, txt in line: #~ For line use your id, info reader.
    some_dict[idt] = some_dict.get(idt, "") + txt

It should create your dream structure without imports, and i hope most efficient way. Just to understand, get have secound argument, what must return if something isn't finded in dict. Then create empty string and add text, if some was finded, then add text to that.

@Edit:

Here is complete example with reader :). Try to replace correctly variable instead of reader entry, which shows how to do it :)

some_dict = {}
with open('file.csv') as f:
    reader = csv.reader(f)
    for idt, info in reader:
        temp = some_dict.get(idt, "")
        some_dict[idt] = temp+" "+txt if temp else txt
print(some_dict)
df = pd.Series(some_dict).to_frame("Title of your column")

This is full program which should work for you. But, it won't work if you got more than 2 columns in file, then u can just replace idt, info with row, and use indexes for first and secound element.

@Next Edit:

For more then 2 columns:

some_dict = {}
with open('file.csv') as f:
    reader = csv.reader(f)
    for row in reader:
        temp = some_dict.get(row[0], "")
        some_dict[row[0]] = temp+" "+row[1] if temp else row[1]
        #~ There you can add something with another columns if u want.
        #~ Example: another_dict[row[2]] = another_dict.get(row[2], "") + row[3]
print(some_dict)
df = pd.Series(some_dict).to_frame("Title of your column")

edited May 27 '19 at 18:58

answered May 26 '19 at 22:00

Guaz

193
1
1
12

Thank you for your comment. I tried your suggestion and i am getting type error as: 'builtin_function_or_method' object is not iterable. I added your suggestion in my post, please let me know if you have any suggestion to fix this. thank you – Bilgin May 26 '19 at 22:24
thank you for your code. I tried and now it like: `some_dict = {} with open('test.csv') as f: reader = csv.reader(f) for idt, info in reader: some_dict[idt] = some_dict.get(idt, "") + info print(some_dict) ` which it gaves me dict and when i want to save it into csv with ` df = pd.DataFrame(data=some_dict) some_dict.to_csv("someCSV.csv", index=True, mode = 'a') ` it gave an error as: `ValueError: If using all scalar values, you must pass an index`. do you have any suggestion for that. Thanks for your time. – Bilgin May 27 '19 at 18:19
Try change it that way: `df = pd.Series(some_dict).to_frame()` then you should get your concatenated strings with orginal indexes :). If it won't work, let me know :) – Guaz May 27 '19 at 18:36
Thank you so much. Now it is saving to csv file but it adding one row at the top (and in the first row, the second column having the value zero). Also is there any way to put one tab space between concatenated strings. like now (i.e) the result for "hello" + "world" is "helloworld". and as you said this method works for only 2 column data. is there any suggestion to expand this to 4 column as well. thnaks – Bilgin May 27 '19 at 18:50
`df = pd.Series(some_dict).to_frame("Title of your column")` - this argument will name your column. `some_dict[idt] = " ".join(some_dict.get(idt, ""), txt)` this will allow your join it with spaces. But will also add space at first position, on edit i'll show another, longer way to do it correctly. Also i'll show you how to do it with more columns, you will decide which you want to use. – Guaz May 27 '19 at 18:57

Max Malysh · Answer 2 · 2019-05-26T22:23:09.163

1

Just make a dictionary where id's are keys:

from collections import defaultdict

by_id = defaultdict(list)

for id, info in your_list:
    by_id[id].append(info)

for key, value in by_id.items():
    print(key, value)

edited May 26 '19 at 22:23

answered May 26 '19 at 21:56

Max Malysh

29,384
19
111
115

@ Max thank you for your comment. I tried your suggestion and unfortunately, I am new in python and I couldn't get the output. I got the error as: invalid syntax for key loop. I attach the implementation in this post. – Bilgin May 26 '19 at 22:12
There was no colon after `by_id.items()`. I've added it. Still, you have to adapt the answer to your needs. You need to fill `your_list` yourself. – Max Malysh May 26 '19 at 22:24

How to merge and combine rows with same id (index) in python?

2 Answers2