0

I've currently split my data into CSV files that need all rows of the column "sequence" combined into one string.

Each CSV looks something like this:

1773.csv
ID      Order    Sequence
1773     1        'AAGG'
1773     2        'TTGG'
1773     3        'GGAA'

1775.csv
ID      Order    Sequence
1775     1         'GGTT'
1775     2         'AAGT'
1775     3         'TGAA'

1331.csv
ID      Order    Sequence
1331     1         'CCGT'
1331     2         'CATT'
1331     3         'GTTA'

I need each CSV to merge each sequence row into one value like this:

ID       Sequence
1773   'AAGGTTGGGGAA'

Then make a master CSV of all the combined sequences from each CSV file.

Something like this:

ID        Sequence
1773    'AAGGTTGGGGAA'
1775    'GGTTAAGTTGAA'
1331    'CCGTCATTGTTA'

I wouldn't worry too much about the order column since the rows are already in order. +Each CSV in the folder is just it's ID

I've found this but it seems to combine all data from all csv files into a single cell/value:

def return_contents(file_name):
    with open(file_name) as infile:
        reader = csv.reader(infile)
        return list(reader)

all_files = os.listdir('C:\\Users\\CAAVR\\Desktop\\res_csv')

combined_output = []

for file in all_files:
    data = return_contents('C:\\Users\\CAAVR\\Desktop\\res_csv\\{}'.format(file))
    for row in data:
        combined_output.extend(row)

with open('csv_out.csv', 'w', newline='') as outfile:
    writer = csv.writer(outfile)
    writer.writerow(combined_output)

Thanks ahead of time and let me know if you need more info.

  • 1
    Well, that details what you _need_ but says nothing of what you _tried_ and the issues you faced. It also doesn't give a [mcve] because `"GGTT"` isn't in the input sequences – roganjosh Mar 01 '20 at 22:00
  • I've added some code that I tried. I'm not 100% on your second comment though. If you are referring to the master csv that would be another combined sequence – Scott Valentine Mar 01 '20 at 22:09
  • Thank you for the edit. The second part of my comment is that examples should be self-consistent. How could you expect `'GGTTAAGTTGAA'` from what's listed in the first code block? `"GGTT"` happened to be a sequence that stood out, but there are other examples. – roganjosh Mar 01 '20 at 22:12
  • I've added the inputs. – Scott Valentine Mar 01 '20 at 22:55

2 Answers2

0

I would use pandas for this task

pandas.read_csv("input.dat").groupby("id")['Sequence'].agg('sum').to_csv('output.dat')

I think would do what you ask

Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
0

I've found this but it seems to combine all data into a single row:

[...]
combined_output.extend(row)

extend adds all items from row individually to combined_output, losing the information that they once belonged together and creating a single flat list.

Instead, you should use append. It adds row to combined_output while keeping the items of row in their own list. The result is a list of lists, where each sublist represents one row in the output CSV.

See What is the difference between Python's list methods append and extend?.

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
  • This seems to give me an output csv of all files combined into one with each row combined into a single column. Instead of combining the rows of each individual csv into one row. Does that make sense? – Scott Valentine Mar 01 '20 at 23:16
  • "Instead of combining the rows of each individual csv into one row." I thought that was what the code you've shown was doing, which was what you did *not* want. – mkrieger1 Mar 01 '20 at 23:20
  • sorry for the confusion. From each split csv i need to combine each "sequence" value into one row. Then add that row to the list of all of them as an individual row. The code I had in the question would combine all rows from all csv files into one value. The append version still seems to combine the enire row into one value. Hope this clears it up. – Scott Valentine Mar 01 '20 at 23:31