0

I have a CSV which contains data as below:

date,datetime,year,month,date,value,name
20170430,2017-04-30 18:30:00,2017,04,30,NaN,A1
20170501,2017-05-01 18:30:00,2017,05,01,121.2,A1
20170430,2018-02-07 18:30:00,2018,02,07,1.23,B1
20170501,2017-07-10 18:30:00,2017,07,10,42.2,C1
20170430,2017-04-30 18:30:00,2017,04,30,32.1,C1

I need to have result as below, i.e. A1, B1, C1 values correspoding to date should be segregated as a seperate column:

date,datetime,year,month,date,A1,B1,C1
20170430,2017-04-30 18:30:00,2017,04,30,NaN,1.23,32.1
20170501,2017-05-01 18:30:00,2017,05,01,121.2,NaN,42.2

I tried to use python pandas pivot method with index as date and columns as name but am getting error as below which is expected because there are multiple entries for A1 and C1

ValueError: Index contains duplicate entries, cannot reshape

import pandas as pd

df = pd.read_csv("D:/datagenicAPI/finalCSV.csv")
print(df)
df1 = df.pivot(index="date", columns="name")
df1.to_csv("d:/datagenicAPI/test1.csv", sep=",")

I need to segregate as seperate columns, can I please know how to achieve the same with python pandas

Shankar Guru
  • 1,071
  • 2
  • 26
  • 46

2 Answers2

1

Loading your example df:

import io
import pandas as pd

s = """
date,datetime,year,month,date,value,name
20170430,2017-04-30 18:30:00,2017,04,30,NaN,A1
20170501,2017-05-01 18:30:00,2017,05,01,121.2,A1
20170430,2018-02-07 18:30:00,2018,02,07,1.23,B1
20170501,2017-07-10 18:30:00,2017,07,10,42.2,C1
20170430,2017-04-30 18:30:00,2017,04,30,32.1,C1
"""
df = pd.read_csv(io.StringIO(s))

Using pivot_table and reset_index you get:

df.pivot_table('value', ['date', 'datetime', 'year', 'month', 'date.1'], 'name').reset_index()

name      date             datetime  year  month  date.1     A1    B1    C1
0     20170430  2017-04-30 18:30:00  2017      4      30    NaN   NaN  32.1
1     20170430  2018-02-07 18:30:00  2018      2       7    NaN  1.23   NaN
2     20170501  2017-05-01 18:30:00  2017      5       1  121.2   NaN   NaN
3     20170501  2017-07-10 18:30:00  2017      7      10    NaN   NaN  42.2

Observe that df contains a column named by pandas as 'date.1', becase in your example there are two columns named 'date'.

abevieiramota
  • 2,481
  • 1
  • 11
  • 3
  • Thanks, this helps !! But, a question is some of the decimals in my csv's are adding lot more decimal places. For example value like 17.388 is modified to 17.387999999999998, please let me know for any idea ? – Shankar Guru May 31 '18 at 08:39
  • @ArunShankar I think it's a problem of float imprecision. To get an explanation you can have a look at Python's docs: https://docs.python.org/3/tutorial/floatingpoint.html – abevieiramota Jun 01 '18 at 09:54
  • Thank you for comment, is it because the decimals are added because the aggfunc is by default mean ? I tried using sum, still no luck. Can I please know on how to handle this directly in pivot_table command ? – Shankar Guru Jun 04 '18 at 08:32
0

I think need two steps , drop_duplicates + unstack, then concat the result

s=df.drop_duplicates('date').iloc[:,:4]
pd.concat([s.set_index('date'),df.set_index(['date','name']).value.unstack()],axis=1)
Out[339]: 
                     datetime  year  month     A1    B1    C1
date                                                         
20170430  2017-04-30 18:30:00  2017      4    NaN  1.23  32.1
20170501  2017-05-01 18:30:00  2017      5  121.2   NaN  42.2
BENY
  • 317,841
  • 20
  • 164
  • 234