9

What can I do to prevent pandas from converting my string values to float. The column Billing Doc. and Sales Order contain number 10-11 digit numbers which are to be stored in MySQL table inside a column which has a datatype of CHAR(15). When I execute the following script, I see .0 at the end of each number. I want to treat them as string/char in our db. The Billing Doc. field contains numbers like 3206790137, 3209056079, 3209763880, 3209763885, 3206790137 who is stored in DB as 3206790137.0, 3209056079.0, 3209763880.0, 3209763885.0, 3206790137.0. The column data type for Billing doc in database is CHAR(15).

def insert_billing(df):
        df = df.where((pd.notnull(df)), None)
        for row in df.to_dict(orient="records"):
            bill_item = row['Bill.Item']
            bill_qty = row['Billed Qty']
            bill_doct_date = row['Billi.Doc.Date']
            bill_doc = row['Billing Doc.']
            bill_net_value = row['Billi.Net Value']
            sales_order = row['Sales Order']
            import_date = DT.datetime.now().strftime('%Y-%m-%d')


            query = "INSERT INTO sap_billing(" \
                    "bill_item, " \
                    "bill_qty, " \
                    "bill_doc_date, " \
                    "bill_doc, " \
                    "bill_net_value, " \
                    "sales_order, " \
                    "import_date" \
                    ") VALUES (" \
                    "\"{}\", \"{}\", \"{}\", \"{}\"," \
                    "\"{}\", \"{}\", \"{}\"" \
                    ") ON DUPLICATE KEY UPDATE " \
                    "bill_qty = VALUES(bill_qty), " \
                    "bill_doc_date = VALUES(bill_doc_date), " \
                    "bill_net_value = VALUES(bill_net_value), " \
                    "import_date = VALUES(import_date) " \
                    "".format(
                        bill_item,
                        bill_qty,
                        bill_doct_date,
                        bill_doc,
                        bill_net_value,
                        sales_order,
                        import_date
                        )
            query = query.replace('\"None\"', 'NULL')
            query = query.replace('(None', '(NULL')
            query = query.replace('\"NaT\"', 'NULL')
            query = query.replace('(NaT', '(NULL')

            try:
                q1 = gesdb_connection.execute(query)
            except Exception as e:
                print(bill_item, bill_doc, sales_order, e)



    if __name__ == "__main__":
        engine_str = 'mysql+mysqlconnector://root:abc123@localhost/mydb'

        file_name = "tmp/dataload/so_tracking.XLSX"
        df = pd.read_excel(file_name)
        if df.shape[1] == 35 and compare_columns(list(df.columns.values)) == 1:
            insert_billing(df)
        else:
            print("Incorrect column count, column order or column headers.\n")

When I create a simple df and print it the problem does not show up.

import pandas as pd
df = pd.DataFrame({'Sales Order': [1217252835, 1217988754, 1219068439], 
                   'Billing Doc.': [3222102723, 3209781889, 3214305818]})
    >>> df
   Billing Doc.  Sales Order
0    3222102723   1217252835
1    3209781889   1217988754
2    3214305818   1219068439

However, when I read through excel and then print it, the column is read as float64.

 file_name = "tmp/dataload/so_tracking.XLSX"
    df = pd.read_excel(file_name)
    print(df['Billing Doc.'])

680    3.252170e+09
681    3.252170e+09
682    3.252170e+09
683    3.252170e+09
684    3.252170e+09
685    3.252170e+09
686    3.252170e+09
687    3.252170e+09
688    3.252170e+09
689    3.252170e+09
690    3.252170e+09
.
.
.
694    3.251601e+09
695    3.251631e+09
696    3.252013e+09
697             NaN
698    3.252272e+09
699    3.252360e+09
700    3.252474e+09
.
.
Name: Billing Doc., dtype: float64
nomad
  • 973
  • 2
  • 9
  • 22
  • 2
    Could you distill this down to a reproducible example? No one else has access to your database or spreadsheet. So any attempt to help will just be speculation. – Paul H Mar 01 '17 at 22:29
  • 2
    Pandas purists may not like this quick fix, but I use `pd.read_csv('file.csv', dtype = object)` and it keeps pandas from converting numbers to floats. I'm fairly certain you can replace `read_csv()` with other DataFrame creating functions. – elPastor Mar 01 '17 at 23:41
  • @PaulH I added a sample. – nomad Mar 06 '17 at 15:31
  • @pshep123. Thanks for your input. It is a .xlsx file and can't be converted into csv as I get it from somewhere else. – nomad Mar 06 '17 at 15:32
  • No, I'm suggesting you use the `dtype = object` as a `read_excel() ` param. I haven't tested, but might be worth a shot. – elPastor Mar 06 '17 at 15:34
  • @nomad: read through the answers for the question below and try again. this question needs radical revisions on several levels. http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – Paul H Mar 06 '17 at 15:52
  • @PaulH I tried showing the problem in a way you mentioned but that way I issue is not there. But when I read the excel using read_excel, the column is read as dtype: float64. – nomad Mar 06 '17 at 16:13

3 Answers3

14

I found the solution myself, posting here to document it.

df = pd.read_excel(file_name, converters={'Billing Doc.' : str})
print(df['Billing Doc.'])

695    3251631331
696    3252012614
697           NaN
698    3252272451
699    3252359504
700    3252473894
701           NaN
702           NaN
703           NaN
704    3252652940
705           NaN
706           NaN
707           NaN
708           NaN
Name: Billing Doc., dtype: object
nomad
  • 973
  • 2
  • 9
  • 22
0

Something similar happened to me because the index of new column was not matching the index of original dataframe, which resulted in NaNs, which caused automatic broadcasting to float. Therefore, check if:

  • index of original dataframe and the new column matches
  • new column contains NaNs.
meliksahturker
  • 922
  • 2
  • 11
  • 20
-1

try this:

df = df.astype(str)

note that this is VERY ineffective

or convert each value to int before insert them into query

Alexander Reshytko
  • 2,126
  • 1
  • 20
  • 28