3

Requirement :

One particular column in a DataFrame is 'Mixed' Type. It can have values like "123456" or "ABC12345".

This dataframe is being written into an Excel using xlsxwriter .

For values like "123456", down the line Pandas converting it into 123456.0 ( Making it look like a float)

We need to put it into xlsx as 123456 (i.e as +integer) in case value is FULLY numeric.

Effort :

Code Snippet shown below

import pandas as pd
import numpy as np
import xlsxwriter
import os
import datetime
import sys
excel_name = str(input("Please Enter Spreadsheet Name :\n").strip())

print("excel entered :   "   , excel_name)
df_header = ['DisplayName','StoreLanguage','Territory','WorkType','EntryType','TitleInternalAlias',
         'TitleDisplayUnlimited','LocalizationType','LicenseType','LicenseRightsDescription',
         'FormatProfile','Start','End','PriceType','PriceValue','SRP','Description',
         'OtherTerms','OtherInstructions','ContentID','ProductID','EncodeID','AvailID',
         'Metadata', 'AltID', 'SuppressionLiftDate','SpecialPreOrderFulfillDate','ReleaseYear','ReleaseHistoryOriginal','ReleaseHistoryPhysicalHV',
          'ExceptionFlag','RatingSystem','RatingValue','RatingReason','RentalDuration','WatchDuration','CaptionIncluded','CaptionExemption','Any','ContractID',
          'ServiceProvider','TotalRunTime','HoldbackLanguage','HoldbackExclusionLanguage']
first_pass_drop_duplicate = df_m_d.drop_duplicates(['StoreLanguage','Territory','TitleInternalAlias','LocalizationType','LicenseType',
                                   'LicenseRightsDescription','FormatProfile','Start','End','PriceType','PriceValue','ContentID','ProductID',
                                   'AltID','ReleaseHistoryPhysicalHV','RatingSystem','RatingValue','CaptionIncluded'], keep=False) 
# We need to keep integer AltID  as is

first_pass_drop_duplicate.loc[first_pass_drop_duplicate['AltID']] =   first_pass_drop_duplicate['AltID'].apply(lambda x : str(int(x)) if str(x).isdigit() == True else x)

I have tried :

1. using `dataframe.astype(int).astype(str)` # works as long as value is not alphanumeric
2.importing re and using pure python `re.compile()` and `replace()` -- does not work
3.reading DF row by row in a for loop !!! Kills the machine as dataframe can have 300k+ records

Each time, error I get:

raise KeyError('%s not in index' % objarr[mask])
KeyError: '[ 102711. 102711. 102711. 102711. 102711. 102711. 102711. 102711.\n 102711. 102711. 102711. 102711. 102711. 102711. 102711. 102711.\n 102711. 102711. 102711. 102711. 102711. 102711. 102711. 102711.\n 102711. 102711. 102711. 102711. 102711. 102711. 102711. 102711.\n 102711. 102711. 102711. 102711. 102711. 102711. 102711. 102711.\n 102711. 102711. 102711. 102711. 102711. 102711. 102711. 102711.\n 102711. 102711. 102711. 102711. 102711. 102711. 102711. 102711.\n 102711. 102711. 102711. 102711. 102711. 102711. 102711. 102711.\n 5337. 5337. 5337. 5337. 5337. 5337. 5337. 5337.\n 5337. 5337. 5337. 5337. 5337. 5337. 5337. 5337.\n 5337. 5337. 5337. 5337. 5337. 5337. 5337. 5337.\n 5337. 5337. 5337. 5337. 5337. 5337. 5337. 5337.\n 5337. 5337. 5337. 5337. 5337. 5337. 5337. 5337.\n 5337. 5337. 2124. 2124. 2124. 2124. 2124. 2124.\n 2124. 2124. 6643. 6643. 6643. 6643. 6643. 6643.\n 6643. 6643. 6643. 6643. 6643. 6643. 6643. 6643.\n 6643. 6643. 6643. 6643. 6643. 6643. 6643. 6643.\n 6643. 6643. 6643. 6643. 6643. 6643. 6643. 6643.] not in index'

I am newbie in python/pandas , any help, solution is much appreciated.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
SanBan
  • 635
  • 5
  • 12
  • So you need convert only numeric values to `float` and non numeric not? – jezrael Sep 21 '16 at 05:44
  • I need to make sure it treats a + integer as a TEXT/STRING and does not add a .0 (decimal point) in the end which actually shows up in excel. – SanBan Sep 21 '16 at 05:50
  • So you need convert all values to `type` `string`? and problem is `Excel` parse `int` values converted to `string` as `float` ? – jezrael Sep 21 '16 at 05:53
  • exactly, any value in AltID should be treated as string in Pandas – SanBan Sep 21 '16 at 05:55
  • I tried your solution `first_pass_drop_duplicate.ix[first_pass_drop_duplicate.AltID.str.isdigit(), 'AltID'] = pd.to_numeric(first_pass_drop_duplicate.AltID, errors='coerce') – SanBan Sep 21 '16 at 05:56
  • Got following error : `raise AttributeError("Can only use .str accessor with string " AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas` – SanBan Sep 21 '16 at 05:57
  • Hmmm, try convert to `string` first: `first_pass_drop_duplicate.ix[first_pass_drop_duplicate.AltI‌​D.astype(str).str.isdigit(), 'AltID'] = pd.to_numeric(first_pass_drop_duplicate.AltID, errors='coerce')` – jezrael Sep 21 '16 at 05:58
  • Or use [`piRSquared`](http://stackoverflow.com/questions/39608282/pandas-error-trying-to-convert-string-into-integer#comment66524836_39608394) idea - `fillna(False)` – jezrael Sep 21 '16 at 06:02

3 Answers3

2

I think you need to_numeric:

df = pd.DataFrame({'AltID':['123456','ABC12345','123456'],
                   'B':[4,5,6]})

print (df)
      AltID  B
0    123456  4
1  ABC12345  5
2    123456  6

df.ix[df.AltID.str.isdigit(), 'AltID']  = pd.to_numeric(df.AltID, errors='coerce')

print (df)
      AltID  B
0    123456  4
1  ABC12345  5
2    123456  6

print (df['AltID'].apply(type))
0    <class 'float'>
1      <class 'str'>
2    <class 'float'>
Name: AltID, dtype: object
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • That is brilliant! It didn't work for my series because the 4th element was already an `int`. `pd.Series([1], dtype=object).str.isdigit()` returns `NaN`. I had to do this: `s.ix[s.str.isdigit().fillna(False)] = pd.to_numeric(s, errors='coerce')` and it worked perfect. – piRSquared Sep 21 '16 at 05:52
  • And! This is almost certainly faster. – piRSquared Sep 21 '16 at 05:54
  • @piRSquared - thank you. Another solution is `df.ix[df.AltID.astype(str).str.isdigit(), 'AltID'] = pd.to_numeric(df.AltID, errors='coerce')` – jezrael Sep 21 '16 at 05:59
1

Use apply and pd.to_numeric with parameter errors='ignore'

consider the pd.Series s

s = pd.Series(['12345', 'abc12', '456', '65hg', 54, '12-31-2001'])

s.apply(pd.to_numeric, errors='ignore')

0         12345
1         abc12
2           456
3          65hg
4            54
5    12-31-2001
dtype: object

Notice the types

s.apply(pd.to_numeric, errors='ignore').apply(type)

0    <type 'numpy.int64'>
1            <type 'str'>
2    <type 'numpy.int64'>
3            <type 'str'>
4            <type 'int'>
5            <type 'str'>
dtype: object
piRSquared
  • 285,575
  • 57
  • 475
  • 624
1

Finally it worked by using 'converters' option in pandas read_excel format as

df_w02 = pd.read_excel(excel_name, names = df_header,converters = {'AltID':str,'RatingReason' : str}).fillna("")

converters can 'cast' a type as defined by my function/value and keeps intefer stored as string without adding decimal point.

SanBan
  • 635
  • 5
  • 12