0

there are many questions asked about rearranging CSV files with python but I could not find an answer to the following problem:

I want to "merge" csv files on a specific column in such way that the data is joined. This reminds me of an SQL join by the way. Lets say:

There is a number of csv files which all look the same. For simplicity, lets assume there are only four items per row - three data items and a category. So one file looks like:

a,b,c,category1
e,f,g,category2
a,c,c,category3

Another one like:

0,0,0,category1
1,1,1,category1
5,5,5,category2
6,6,6,category2
9,9,9,category4

Now, I want to merge these files to have the following result:

a,b,c,0,0,0,category1
e,f,g,5,5,5,category2

So, the rows need to be joined in such way that the items of every lines are combined while the "category" item is the joining key and that data may be dropped if there are not enough matching rows. Best would be to have a variable number of files so that not only two but X files can be merged (or joined) this way.

Is there a pythonic way to do this?

coroner
  • 9
  • 1
  • 7
  • In your output example, for `category1` you keep `0,0,0` and you discard `1,1,1`. Does this imply that you want to keep only the 1st matching record? --- later, you introduce the topic of multiple files: do you want a merge based on the first file content only or... your question is not properly a clear question... you should try to be more precise. – gboffi Apr 22 '15 at 10:42
  • I am sorry, if this stayed unclear. At first: Yes, the first matching line is fine. That is why I do not want to use unix `join` method. At second: The merging would be based on the minimum amount of entries for a category found in all files. If category X has 20 instances in file1, 10 instances in file2 and 12 instances in file3, there should be 10 instances in the combined file. – coroner Apr 22 '15 at 13:13

2 Answers2

0

A dictionary of dictionaries should do the job
1. Read the data using readlines()
2. Populate a dictionary using category as key

dictionary = {}
for line in lines:
    entries = line.split(',')
    dictionary[entries[3]] = {}
    dictionary[entries[3]][0] = entries[0]
    dictionary[entries[3]][1] = entries[1]
    dictionary[entries[3]][2] = entries[2]
  1. Read the other file and if the dictionary contains the category then update the values.

for line in lines: entries = line.split(',') if entries[3] in dictionary: //populate accordingly dictionary[entries[3]][3] = entries[0] dictionary[entries[3]][4] = entries[1] dictionary[entries[3]][5] = entries[2]

Martin Boyanov
  • 416
  • 3
  • 13
  • Maybe I misunderstand, but the code of point 2. will result in only one dict of values for every category, so for the example above you would get `{'category1': {0: '1', 1: '1', 2: '1'}, 'category2': {0: '6', 1: '6', 2: '6'}, 'category4': {0: '9', 1: '9', 2: '9'}}` So, if merged with other files, there will be only one instance for every category. – coroner Apr 22 '15 at 13:29
0

Thank you for your inspirating thoughts!

I figured out a rather dirty way:

  1. Read the data in such way that there is a dict with the category as keys and another dict as value. This "inner dict" has the filenames as keys and the values as values.

    data = {'category1' : {'file1' : [[a,b,c]], 'file2' : [[0,0,0],[1,1,1]] }

In another dict I store the minimum number of instances for every category (let's say 2 for category 1):

`limits = {"category1":2*len(infiles)*featuresinfiles,"category2": . . .}`

Now I create a dictionary "comb" containing the combination:

for c in data.keys(): if c not in comb.keys(): comb[c] = [] while len(comb[c]) < limits[c]: for f in data[c]: comb[c] += (data[c][f].pop(0))

This finally gives a dict with the categories as keys and their combined values as values; in the example:

{'category1' : [a,b,c,0,0,0], ... }

I'm sure there is a better way to do this, but as a workaround, this does what I want.

coroner
  • 9
  • 1
  • 7