2

I am new to pandas. I need to read a xlsx file and convert first column to key of a dict and second column to values of a dict using pandas. I also need to skip / exclude first row which are headers.

The answer here is for pymysql and here is for csv. I need to user pandas.

Here is a sample excel data

dict_key    dict_value  
key1        str_value1  
key2        str_value2  
key3         None  
key4         int_value3  

My code so far is as below.

import pandas as pd

excel_file = "file.xlsx"
xls = pd.ExcelFile(excel_file)
df = xls.parse(xls.sheet_names[0], skiprows=1, index_col=None, na_values=['None'])
data_dict = df.to_dict()

However, it gives me dict where keys are column numbers and values are both column1 data as well as column2 data.

>>> data_dict
{u'Chg_Parms': {0: u'  key1 ', 1: u'   key2 ', 2: u'   key3 ', 3: u'   key4 ', 4: u'   str_value1 ', 
                5: u'   str_value2 ', 6: u'   Nan ', 6: u'   int_value3 '}}

what I would like to have is column1 data as key and column two data as values and also NaN replaced with None

data_dict = {'key1': 'str_value1', 'key2': 'str_value2', 'key3': None, 'key4': int_value3}

Thanks for your help.

Community
  • 1
  • 1
Anil_M
  • 10,893
  • 6
  • 47
  • 74

2 Answers2

3

You can use pandas read_excel method to read the excel file more conveniently. You can pass a index_col argument where you can define which column of your xlsx is the index.

How to change NaN to None is explained in this question.

Given an xlsx file called example.xlsx which is build like you wrote above, the following code should give your expected results:

import pandas as pd

df = pd.read_excel("example.xlsx", index_col=0)
df = df.where(pd.notnull(df), None)

print df.to_dict()["dict_value"]
Community
  • 1
  • 1
johannesmik
  • 731
  • 1
  • 8
  • 18
1

You can use a collections.OrderedDict to keep the keys in order. You'll note that pd.read_excel loads the first sheet by default. Edit: then you say you want to encode the items in the dictionary, and evaluate 'None' as None...

import collections as co
import pandas as pd

df = pd.read_excel('file.xlsx')
df = df.where(pd.notnull(df), None)
od = co.OrderedDict((k.strip().encode('utf8'),v.strip().encode('utf8')) 
                    for (k,v) in df.values)

Result:

>>> od
OrderedDict([(u'key1', u'str_value1'), (u'key2', u'str_value2'), (u'key3', u'None'), (u'key4', u'int_value3')])

General note: you should keep strings as Unicode within your Python program.

mechanical_meat
  • 163,903
  • 24
  • 228
  • 223
  • @ bernie Thanks for the ans. This is definitely towards what I need. However, how do I convert each key value to non unicode representation, strip white space and also maintain its type. for eg. str(u' 1') results in '1' and str(u'None') results in 'None'. I need `int` and `boolean` values as it is. – Anil_M Apr 21 '17 at 17:34
  • @Anil_M: you're very welcome. Please see edited answer. – mechanical_meat Apr 21 '17 at 17:50
  • I added .strip() next to encode('utf8') to take care of white space. I belive that answers my questions. thanks. – Anil_M Apr 21 '17 at 17:58
  • @Anil_M: anytime! Happy coding to you. – mechanical_meat Apr 21 '17 at 17:58