0

I have a csv file with variable column widths where each row is a date, a region, and a set of ID’s that were observed at that date/region.

The data look like this:

12/01/2017,Region1,BMW_123,TESLA_332,TESLA_2002
11/07/2017,Region2,TESLA_332,BMW_123,TESLA_2002,TESLA_99812
11/19/2017,Region2,BMW_123,TESLA_31
10/23/2017,Region1,BMW_4,TESLA_3,TESLA_90
11/02/2017,Region2,TESLA_28,BMW_56,TESLA_22,TESLA_821
10/14/2017,Region2,BMW_1,BMW_8,BMW_2,TESLA_3,TESLA_4,TESLA_99,TESLA_81,TESLA_56

I am trying to:

(1) Split all of the ID’s and just keep the brand, so “TESLA_12345” would become just “TESLA”. This part is no problem.

(2) Aggregate by month-region, so that we know the count of failures for each brand (e.g. Tesla or BMW) for each month-region tuple.

The output should look something like this (format is not super important – it just has to be legible):

month region BMW TESLA
10 Region1   1     2
12 Region1   1     2
10 Region2   3     5
11 Region2   3     7

It seems like a natural way to do this would be to read the CSV line-by-line and update counts for each brand in nested dictionaries (i.e. have a regions dict containing 2 regions, each of which contains month dicts that keep a count such as {BMW:2, TESLA: 1}. However, I’m struggling with updating the nested dicts, and I wonder if there is a much simpler solution (or if Pandas can do this effortlessly, etc.)

(note: obviously, the month can be extracted from the date with:

datetime.strptime(mydate, "%m/%d/%Y").strftime("%m")

)

cataclysmic
  • 337
  • 2
  • 12
  • Possibly related https://stackoverflow.com/questions/15242746/handling-variable-number-of-columns-with-pandas-python – maxymoo Dec 19 '17 at 04:29
  • Maybe you could do something with stack, e.g. `df_stacked = df.set_index(['date','region']).stack()` – maxymoo Dec 19 '17 at 04:41
  • Thanks for the reply! I found a few posts on pandas with uneven columns, e.g. https://stackoverflow.com/questions/15242746/handling-variable-number-of-columns-with-pandas-python/15252012 and https://stackoverflow.com/questions/40880724/pandas-failing-with-variable-columns but my python is not strong enough to know whether using a pandas dataframe is a 'good' way to handle this case, or whether line-by-line is more suited, esp. since it's sortof a nontraditional aggregate (because we need to get the within-row counts before aggregating across rows). – cataclysmic Dec 19 '17 at 04:42

1 Answers1

4

Here's one approach. I wouldn't call it pretty, but it gets the job done.

  1. The first problem is the different number of fields in each row. You can read in your file one line at a time, and store it in a list, data. You can also trim off the car IDs while you're doing this:

    import pandas as pd
    
    # assuming CSV is named test.csv
    f = open("test.csv", "r")
    
    data = []
    for i, line in enumerate(f.readlines()):
        splitted = line.split(",")
        just_brand = [x.split("_")[0] for x in splitted]
        data.append(just_brand)
    
  2. Now that we have the file read into a Python data structure, we can reorder the file lines such that the entry with the greatest number of fields is on the top. This is good for Pandas, as it can handle missing columns much better than extra columns. If we start out with the greatest number of columns, shorter subsequent rows will be handled gracefully.

    df = pd.DataFrame(sorted(data, key=lambda row: len(row), reverse=True))
    
    df
                0        1      2      3      4      5      6      7      8      9
    0  10/14/2017  Region2    BMW    BMW    BMW  TESLA  TESLA  TESLA  TESLA  TESLA
    1  11/07/2017  Region2  TESLA    BMW  TESLA  TESLA   None   None   None   None
    2  11/02/2017  Region2  TESLA    BMW  TESLA  TESLA   None   None   None   None
    3  12/01/2017  Region1    BMW  TESLA  TESLA   None   None   None   None   None
    4  10/23/2017  Region1    BMW  TESLA  TESLA   None   None   None   None   None
    5  11/19/2017  Region2    BMW  TESLA   None   None   None   None   None   None
    
  3. From here, it's just a matter of getting things named, organized, and formatted.

    df = (df.set_index([0,1])
            .stack()
            .reset_index(level=1)
            .rename(columns={1:"region",0:"make"})
            .reset_index(level=1, drop=True))
    
    df = (df.groupby([pd.to_datetime(df.index).month,"region","make"])
            .make.count()
            .unstack()
            .reset_index()
            .rename(columns={0:"month"}))
    df.columns.name = ""
    
    df
        region  BMW  TESLA  month
    0  Region1    1      2     10
    1  Region2    3      5     10
    2  Region2    3      7     11
    3  Region1    1      2     12
    
  4. (alternate, less gymnastic)

    # get TESLA, BMW counts for each row
    cts = df.iloc[:,2:].apply(lambda x: x.value_counts(), axis=1)
    # merge with date, region
    df2 = pd.concat([df.iloc[:, :2], cts], axis=1)
    # groupby and sum
    (df2.groupby([pd.to_datetime(df[0]).dt.month,1])
        .sum()
        .reset_index()
        .rename(columns={0:"month",1:"region"}))
    
andrew_reece
  • 20,390
  • 3
  • 33
  • 58