0

Working with pandas and PyTables. Begin by importing a table from excel containing columns of integers and floats, as well as other columns containing strings and even tuples. There are a limited number of options on the excel import and unfortunately, unlike the csv import process, datatypes must be converted from their inferred types after the import and cannot be specified in the process.

That being said, all non-numeric is apparently imported as unicode text which is incompatible with a later export to HDF5. Is there a simple way to convert all unicode columns (as well as all column headings) to an HDF5 compatible string format?

MORE DETAILS:

>>> metaFrame.head()
                               ProjectName Company ContactName  \
LocationID                                                       
935          PCS Petaluma High School Site  Testco   Test Name   
937            PCS Casa Grande High School  Testco   Test Name   
3465               FUSD Fowler High School  Testco   Test Name   
3466             FUSD Sutter Middle School  Testco   Test Name   
3467        FUSD Fremont Elementary School  Testco   Test Name   

                      Contactemail  \
LocationID                           
935         test.address@email.com   
937         test.address@email.com   
3465        test.address@email.com   
3466        test.address@email.com   
3467        test.address@email.com   

                                                         Link  Systemsize(kW)  \
LocationID                                                                      
935         https://internal.testco.com/locations/935/syst...             NaN   
937         https://internal.testco.com/locations/937/syst...          675.39   
3465        https://internal.testco.com/locations/3465/sys...          384.30   
3466        https://internal.testco.com/locations/3466/sys...          198.90   
3467        https://internal.testco.com/locations/3467/sys...           35.10   

           SystemCheckStartdate SystemCheckActive  \
LocationID                                          
935         2013-10-01 00:00:00              True   
937         2013-10-01 00:00:00              True   
3465        2013-10-01 00:00:00              True   
3466        2013-10-01 00:00:00              True   
3467        2013-10-01 00:00:00              True   

            YTDProductionPriortostartdate  NumberofInverters/cktsmonitored  \
LocationID                                                                   
935                                   NaN                              NaN   
937                                   NaN                              NaN   
3465                                  NaN                              NaN   
3466                                  NaN                              NaN   
3467                                  NaN                              NaN   

                                                  InverterMfg InverterModel  \
LocationID                                                                    
935                                     PV Powered : PVP260KW           NaN   
937                                     PV Powered : PVP260KW           NaN   
3465        Advanced Energy Industries : Solaron 333kW (31...           NaN   
3466                                    PV Powered : PVP260KW           NaN   
3467                                 PV Powered : PVP35KW-480           NaN   

            InverterCECefficiency ModuleMfg Modulemodel  \
LocationID                                                
935                          97.0       NaN         NaN   
937                          97.0       NaN         NaN   
3465                         97.5       NaN         NaN   
3466                         97.0       NaN         NaN   
3467                         96.0       NaN         NaN   

            Moduleirradiancefactor  Moduleirradiancefactorslope  \
LocationID                                                        
935                            NaN                          NaN   
937                            NaN                          NaN   
3465                           NaN                          NaN   
3466                           NaN                          NaN   
3467                           NaN                          NaN   

            StraightLineIntercept  ModuleTemp-PwrDerate MeterDK      
LocationID                                                           
935                           NaN                 0.005    3291 ...  
937                           NaN                 0.005   11548 ...  
3465                          NaN                 0.005   19248 ...  
3466                          NaN                 0.005   15846 ...  
3467                          NaN                 0.005   15847 ...  

[5 rows x 27 columns]

>>> metaFrame.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 43 entries, 935 to 5844
Data columns (total 27 columns):
ProjectName                        43  non-null values
Company                            43  non-null values
ContactName                        43  non-null values
Contactemail                       43  non-null values
Link                               43  non-null values
Systemsize(kW)                     42  non-null values
SystemCheckStartdate               37  non-null values
SystemCheckActive                  43  non-null values
YTDProductionPriortostartdate      0  non-null values
NumberofInverters/cktsmonitored    2  non-null values
InverterMfg                        42  non-null values
InverterModel                      8  non-null values
InverterCECefficiency              33  non-null values
ModuleMfg                          0  non-null values
Modulemodel                        0  non-null values
Moduleirradiancefactor             0  non-null values
Moduleirradiancefactorslope        0  non-null values
StraightLineIntercept              0  non-null values
ModuleTemp-PwrDerate               43  non-null values
MeterDK                            43  non-null values
Genfieldname                       43  non-null values
WSDK                               34  non-null values
WSirradianceField                  43  non-null values
WSCellTempField                    43  non-null values
MiscDerate                         1  non-null values
InverterDKs                        37  non-null values
Invertergenfields                  37  non-null values
dtypes: bool(1), datetime64[ns](1), float64(9), object(16)
riddley_w
  • 249
  • 1
  • 4
  • 7
  • pytables 3.0.0 (and pandas 0.12) support unicode in stores, on python3 you can use unicode pretty much anywhere. what os, pandas, pytables versions? – Jeff Jan 23 '14 at 22:14
  • mac os mavericks, python 2.7.6, pytables 3.0. Didn't want to upgrade to python 3.0 as i was afraid it'd be an assache. – riddley_w Jan 23 '14 at 22:16
  • can you show your excel reading code? what version of pandas? – Jeff Jan 23 '14 at 22:22
  • version 0.13 of pandas. `metaFrame = pd.read_excel('path', 'MetaData', na_values=['NA'])` The column that I would like to use as index gets interpreted as a float, so I import the data, convert that column to an int and then index on that column: `metaFrame[['LocationID']]=metaFrame[['LocationID']].astype(int)` `metaFrame.index = metaFrame['LocationID']` – riddley_w Jan 23 '14 at 23:06
  • @Jeff perhaps related http://stackoverflow.com/questions/21296571/parse-string-into-columns-python-pandas-xa0-in-stead-of-white-space (I thought it was reading latin-1 as if it were unicode.). Maybe the other excel engine fixes some encoding issues... ? :S – Andy Hayden Jan 23 '14 at 23:11
  • @riddley_w you should use single brackets to reference columns: `metaFrame['LocationID']=metaFrame['LocationID'].astype(int)`. and also `metaFrame.set_index('LocationID')` – Andy Hayden Jan 23 '14 at 23:11
  • can u post df.head() and df.info() for your data? – Jeff Jan 24 '14 at 02:05

0 Answers0