0

I have a large data set which the head of it looks like this:

          Time  Company 1  Company 2  ...  Company 12  Company 13  Company 14
0    1/01/2015         49         19  ...          80          56          61
1    2/01/2015          2         92  ...          28          98          72
2    3/01/2015         39         69  ...          93          10          66
3    4/01/2015         31         54  ...          78           9          84
4    5/01/2015         37         81  ...          59          32          99
5    6/01/2015         83         69  ...          67          12          69
6    7/01/2015         38          7  ...          37          50          22
7    8/01/2015         75         90  ...          29          99          90
8    9/01/2015          2          7  ...           4          84          59
9   10/01/2015         14         66  ...          40          51          78
10  11/01/2015         20         87  ...           7          26          52
11  12/01/2015         33         84  ...          80          81          69
12  13/01/2015         93         90  ...           4          55          65
13  14/01/2015         95         47  ...          16          43          75
14  15/01/2015         71         66  ...          53          36           0
15  16/01/2015         98         26  ...          41          48          31
16  17/01/2015         29         29  ...           3          13           9
17  18/01/2015         46         96  ...          99          52          71
18  19/01/2015         27         91  ...          50          65          21
19  20/01/2015         45         64  ...          24          97          87

[20 rows x 15 columns]

This is in a csv file, I then have a separate csv file that says which company is in which class, like so:

     Company1  B
0    Company2  B
1    Company3  C
2    Company4  A
3    Company5  A
4    Company6  A
5    Company7  C
6    Company8  B
7    Company9  A
8   Company10  A
9   Company11  B
10  Company12  C
11  Company13  A
12  Company14  C

I want to get rid of the columns company1,...,company14 and instead just have the columns A,B,C. I know how to do this in excel with a simple VLOOKUP function, however my data set is very large and thus using VLOOKUP in excel is not viable.

Is there any way I can achieve my desired result using python?

Here is what my desired output looks like:

          Time    A    B    C
0    1/01/2015  283  228  230
1    2/01/2015  303  158  224
2    3/01/2015  243  163  277
3    4/01/2015  306  259  250
4    5/01/2015  257  232  242
5    6/01/2015  258  160  288
6    7/01/2015  406  136  170
7    8/01/2015  407  180  239
8    9/01/2015  283   98  127
9   10/01/2015  321  144  248
10  11/01/2015  344  296  212
11  12/01/2015  372  175  243
12  13/01/2015  133  249  181
13  14/01/2015  251  245  214
14  15/01/2015  227  263  226
15  16/01/2015  344  154  180
16  17/01/2015  361  146  124
17  18/01/2015  337  290  245
18  19/01/2015  252  247  212
19  20/01/2015  315  169  219

EDIT1:

       Company1  B
0      Company2  B
1      Company3  C
2      Company4  A
3      Company5  A
4      Company6  A
5      Company7  C
6      Company8  B
7      Company9  A
8     Company10  A
9     Company11  B
10    Company12  C
11    Company13  A
12    Company14  C
13    Company15  B
14    Company16  B
15    Company17  C
16    Company18  A
17    Company19  A
18    Company20  A
19    Company21  C
20    Company22  B
21    Company23  A
22    Company24  A
23    Company25  B
24    Company26  C
25    Company27  A
26    Company28  C
27    Company29  B
28    Company30  B
29    Company31  C
..          ... ..
301  Company303  A
302  Company304  A
303  Company305  B
304  Company306  C
305  Company307  A
306  Company308  C
307  Company309  B
308  Company310  B
309  Company311  C
310  Company312  A
311  Company313  A
312  Company314  A
313  Company315  C
314  Company316  B
315  Company317  A
316  Company318  A
317  Company319  B
318  Company320  C
319  Company321  A
320  Company322  C
321  Company323  B
322  Company324  B
323  Company325  C
324  Company326  A
325  Company327  A
326  Company328  A
327  Company329  C
328  Company330  B
329  Company331  A
330  Company332  A

[331 rows x 2 columns]
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
user11015000
  • 151
  • 1
  • 15

2 Answers2

1

You need to do four distinct operations to get to your desired state. It looks like you are using pandas, so I'll outline a solution using pandas.

  1. melt the large dataframe. This lets you turn columns into rows in a similar way that you could use a pivot table in excel.

Pandas Melt Function

  1. merge the large dataframe (make this the left) with the small dataframe (make this the right). This is a more powerful version of a vlookup. You want to use a left join that is on both time and class.

Pandas Merging 101

  1. aggregate the result from your merge and group by time. I assume you want to sum, but you can do many other aggregate functions like count, min, max. Check the pandas documentation for more details.

Pandas group-by and sum

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html

  1. pivot the data into your desired result shape. A pivot is the opposite of melt. With the melt we transposed columns into rows. With a pivot, we are turning rows into columns,

How to pivot a dataframe

Here's a code snippet that is written in logical steps, not for optimal performance. The hope is that you can walk through each step for better understanding. Note that I put headers in the csv files, so you may need to declare column names in your dataframe depending on how you are importing the data.

largeCsv Headers: Time,Company1,Company2,Company3,Company4,Company5

smallCsv Headers: name, class

import pandas as pd

dfSmall = pd.read_csv('c:/temp/smallCsv.csv')

dfLarge = pd.read_csv('c:/temp/largeCsv.csv')

dfMeltedLarge = pd.melt(dfLarge, id_vars=['Time'], var_name='name')

dfMerged = dfMeltedLarge.merge(dfSmall, on='name', how='left')[['Time', 'class', 'value']]

dfSum = dfMerged.groupby(['Time', 'class']).sum()

dfResult = dfSum.pivot_table(index='Time', columns='class', values='value')
1

Below is the reproducible dataset you can run the code with.

The code:

import pandas as pd
pd.options.display.max_columns = 10

headers = [
    'Time',
    'Company1',
    'Company2',
    'Company3',
    'Company4',
    'Company5',
    'Company6',
]
records = [
    ('1/01/2015', 10, 20, 30, 40, 50, 60),
    ('2/01/2015', 100, 200, 300, 400, 500, 600),
    ('3/01/2015', 1000, 2000, 3000, 4000, 5000, 6000),
]

data = pd.DataFrame.from_records(records, columns=headers)

mapper = pd.DataFrame(
    [
        ('Company1', 'A'),
        ('Company2', 'A'),
        ('Company3', 'B'),
        ('Company4', 'B'),
        ('Company5', 'C'),
        ('Company6', 'C'),
    ],
    columns=['Company', 'Class'],
)
for col_name in ['A', 'B', 'C']:
    class_companies_group = mapper[mapper['Class'] == col_name]['Company'].values
    data[col_name] = data[class_companies_group].sum(axis=1)

result = data[['Time', 'A', 'B', 'C']]
print(result)

The output result:

        Time     A     B      C
0  1/01/2015    30    70    110
1  2/01/2015   300   700   1100
2  3/01/2015  3000  7000  11000
Alex Tereshenkov
  • 3,340
  • 8
  • 36
  • 61