I have a csv file with variable column widths where each row is a date, a region, and a set of ID’s that were observed at that date/region.
The data look like this:
12/01/2017,Region1,BMW_123,TESLA_332,TESLA_2002
11/07/2017,Region2,TESLA_332,BMW_123,TESLA_2002,TESLA_99812
11/19/2017,Region2,BMW_123,TESLA_31
10/23/2017,Region1,BMW_4,TESLA_3,TESLA_90
11/02/2017,Region2,TESLA_28,BMW_56,TESLA_22,TESLA_821
10/14/2017,Region2,BMW_1,BMW_8,BMW_2,TESLA_3,TESLA_4,TESLA_99,TESLA_81,TESLA_56
I am trying to:
(1) Split all of the ID’s and just keep the brand, so “TESLA_12345” would become just “TESLA”. This part is no problem.
(2) Aggregate by month-region, so that we know the count of failures for each brand (e.g. Tesla or BMW) for each month-region tuple.
The output should look something like this (format is not super important – it just has to be legible):
month region BMW TESLA
10 Region1 1 2
12 Region1 1 2
10 Region2 3 5
11 Region2 3 7
It seems like a natural way to do this would be to read the CSV line-by-line and update counts for each brand in nested dictionaries (i.e. have a regions dict containing 2 regions, each of which contains month dicts that keep a count such as {BMW:2, TESLA: 1}. However, I’m struggling with updating the nested dicts, and I wonder if there is a much simpler solution (or if Pandas can do this effortlessly, etc.)
(note: obviously, the month can be extracted from the date with:
datetime.strptime(mydate, "%m/%d/%Y").strftime("%m")
)