0

OK major example update needed.

I have exactly this: Joining all rows of a CSV file that have the same 1st column value in Python (first I must appologize for not getting how to just raise the tread again...)

I have a x.CSV file that goes like this: INT;INT2;STRING;STRING;STRING;STRING;STRING;STRING;STRING;STRING;STRING;and_so_on......

0;0;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;-1.0;0;0;-1;-1.0

0;0;;;;;;;;;;;;;30.0;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

0;0;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;160.0;;;;;;;;;;;;;;;;;;;;;;

0;0;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;0.0;;;;;;;;;;;;;;;;;;;;

0;0;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;0.0;;;;;;;;;;;;;;;;;;;

0;0;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;1668.0;;;;;;;;;;;;;;;;;;;;;;;;;

0;0;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;1672.0;;;;;;;;;;;;;;;;;;;;;;;;

0;0;;;;;;;;;;;;;;;;;;0.0;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

0;0;;;;;;;;;;;;;;;;;;;74.0;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

1;1;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;0.0;;;;;;;;;;;;

1;1;0.0;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

3;3;4.0;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

3;3;;;;;;;;;;;;;;;;;;;;;;;;;75.0;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

5;5;0.0;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

5;5;;;;;;;;;;;;;;;;;;;;;;;;;85.0;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

Now, I need a way to join all of the rows that have the same 1st column name into one column, for example:

INT;INT2;STRING;STRING;STRING;STRING;STRING;STRING;STRING;STRING;STRING;and_so_on......

0;0;;;;;;;;;;;;;30.0;;;;;0.0;74.0;;;;;;;;;;;;;;;;1668.0;1672.0;;;160.0;;;;;0.0;0.0;;;;;;;;;;;;;;-1.0;0;0;-1;-1.0

1;1;0.0;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;0.0;;;;;;;;;;;;

3;3;4.0;;;;;;;;;;;;;;;;;;;;;;;;75.0;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

5;5;0.0;;;;;;;;;;;;;;;;;;;;;;;;85.0;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

It is the same number of colums (even though "resultrow" 0;0.... might have too many ";" since i faked it ;-) numbers only turn up once in the same indexrow, so it's kind of a "merge the same indexrow" I need something that reads the >20k (it can be >1million as well) rows long file. Compresses the list And writes the file back to disk again.

Helpers only tip in what direction to code and I'm not able to :-/ (yet... :-) With the greatest respect - I don't code blush - I can read it but not design it. Me posting code would be - lets say amusing :-)

I would very much appriciate the help. br //

Community
  • 1
  • 1
  • Just realized I got a header in text, the other rows are integers or text. – user3310565 Feb 14 '14 at 15:02
  • 2
    What have you tried? It will be much easier to help you if you post the code that you have. – kylieCatt Feb 14 '14 at 15:03
  • You might need to explain your notion of "joining two rows". To many, that would mean appending one after the other, but you seem to mean something different... Something more along the lines of "for each field, pick the 'best' value from the two alternatives", but what is that "best" value? What do you want to do with conflicts? – twalberg Feb 14 '14 at 15:40
  • Made a major example update in order to try to clarify. I do not want to append. I wish as you state @twalberg pick values from different rows of the same index row and make a single, unique, index row with all values form that index. Conflicts "should not" be there however an average will do just fine :-) – user3310565 Feb 25 '14 at 10:39

1 Answers1

0
import csv
from itertools import izip_longest

def merge_rows(a, b):
    return [x or y for x,y in izip_longest(a, b, fillvalue='')]

def main():
    data = {}

    with open("infile.csv", "rb") as inf:
        incsv = csv.reader(inf, delimiter=";")
        header = next(incsv, [])
        for row in incsv:
            label = row[0]
            try:
                data[label] = merge_rows(data[label], row)
            except KeyError:
                data[label] = row

    # write data in sorted order by label
    keys = sorted(data, key=lambda k: int(k))    # Python 2
    # keys = sorted(data.keys(), key=lambda k: int(k))    # Python 3

    with open("outfile.csv", "wb") as outf:
        outcsv = csv.writer(outf, delimiter=";")
        outcsv.writerow(header)
        outcsv.writerows(data[key] for key in keys)

if __name__=="__main__":
    main()

Edit: I made a few mods based on your sample data:

  1. added a delimiter=";" argument to the csv reader and writer

  2. added code to read and write the header

  3. added a key clause so sort order is numeric, not lexicographic

How it works:

for row in incsv: For each row in the data file, we get a list - something like ["0", "0", "", "", "", "", "", "", "", "", "", "", "-1.0", "0", "0", "-1", "0"]. Then label = row[0] gives label a value of "0" - your desired first-column value - and we look for data[label], a combined row from all preexisting rows having that label.

If that combined row already exists, we merge the new row into it (stored_row = merge_rows(stored_row, new_row); otherwise it is created with the new row value (["0", "0", "", "", "", "", "", "", etc). So effectively merge_rows is called for every occurrence of each label except the first time it appears.

merge_rows takes a pair of lists and combines them - izip_longest returns corresponding entries, ie izip_longest([0, 1, 2], ["a", "b", "c"]) gives (0, "a"), (1, "b"), (2, "c"). If one list is shorter than the other, it pads it with fillvalue to match the length of the longest list it received. x and y get assigned the corresponding value from each list, and we or them together because... well, because or combines them the way you want ('' or '1' == '1', '1' or '' == '1', '' or '' == ''). It then takes all the resulting values and returns them as a list - the resulting combined row.

Hope that helps.

Hugh Bothwell
  • 55,315
  • 8
  • 84
  • 99
  • This answer just sorts the rows not merging them as if it went for keyerror all the time. it alsom puts the header at the bottom. Thanx for post though. // – user3310565 Feb 25 '14 at 10:37
  • I don't get this miricle row return [x or y for x,y in izip_longest(a, b, fillvalue='')] – user3310565 Feb 26 '14 at 17:34
  • I don't get this miricle row return [x or y for x,y in izip_longest(a, b, fillvalue='')] /sorry I just can't make sence of it. And apperently it's never used. – user3310565 Feb 26 '14 at 17:42