If I understand correctly, you have multiple files, each corresponding to a given "sensor", with the identity of the sensor in the filename. You want to read the files, then write them out in to separate files again, this time divided by "source", with the data from the different sensors combined into several final rows.
Here's what I think you want to do:
- Read the data in, and build a nested dictionary data structure, as follows:
- The top level key would be the source (e.g.
'a'
).
- The second level will be keyed by a
(date, time)
tuple.
- The inner most level will be keyed by sensor, taken from the filename, and have the actual sensor readings as values.
- You'd also want to keep track of all the sensors that have been seen.
- To write the data out, you'd loop over the items of the outermost dictionary, creating a new output file for each one.
- The rows of each file would be determined by sorting the keys of the next dictionary.
- The last value of each row would be formed by concatenating the values of the innermost dict, filling in an empty string for any missing values.
Here's some code:
from collections import defaultdict
from datetime import datetime
import csv
import glob
import os
# data structure is data[source][date, time][sensor] = value, with "" as default value
data = defaultdict(lambda: defaultdict(lambda: defaultdict(str)))
sensors = []
filelist = glob.glob("*.csv")
# read old files
for fn in filelist:
sensor = os.path.splitext(fn)[0]
sensors.append(sensor)
with open(fn, 'rb') as f:
reader = csv.DictReader(f, delimiter=";")
for row in reader:
date = datetime.strptime(row['date'], '%m/%d/%y')
data[row['source']][date, row['hour']][sensor] = row['values']
sensors.sort() # note, this may not give the best sort order
header = ['date', 'hour', 'source'] + sensors
for source, source_data in data.iteritems():
fn = "{}.csv".format(source)
with open(fn, 'wb') as f:
writer = csv.writer(f, delimiter=";")
writer.writerow(header)
for (date, time), hour_data in sorted(source_data.items()):
values = [hour_data[sensor] for sensor in sensors]
writer.writerow([date.strftime('%m/%d/%y'), time, source] + values)
I only convert the date field to an internal type because otherwise sorting based on dates won't work correctly (dates in January 2013 would appear before those in February 2012). In the future, consider using ISO 8601 style date formating, YYYY-MM-DD
, which can be safely sorted as a string. The rest of the values are handled only as strings with no interpretation.
The code assumes that the sensor
values can be ordered lexicographically. This is likely if you only have a few of them, e.g. s1
and s2
. However, if you have a s10
, it will be sorted ahead of s2
. To solve this you'll need a "natural" sort, which is more complicated than I can solve here (but see this recent question for more info).
One final warning: This solution may do bad things if you run it mutliple times in the same folder. That's because the output files, e.g. a.csv
will be seen by glob.glob('*.csv')
as input files when you run again.