0

In work, we use oracle sql database, in times on times(rarely, but happens), the database is feed with data in wrong format, like this:

Sales Price
s1 10.00
s2 10,00
s3 10

All lines has same price, but in different formats, how can i standardize price column in same format using python?

Follow the code used:

import pandas as pd
import cx_Oracle
import numpy as np

cx_Oracle.init_oracle_client(path to oracle client)

def connect(user, password, host):
    connection = cx_Oracle.connect(user=user, password = password, dsn = host)
    cursor = connection.cursor()
    return cursor

def sql(query,cursor):
    cursor.execute(query)
    result = cursor.fetchall()
    cols = [i[0] for i in cursor.description]
    df = pd.DataFrame(result, columns=[cols])
    return df

query = """
querie
"""

df = sql(query,cursor)
df.columns = df.columns.get_level_values(0)
  • Maybe you can try https://stackoverflow.com/questions/6633523/how-can-i-convert-a-string-with-dot-and-comma-into-a-float-in-python. – Ynjxsjmh Apr 14 '22 at 16:21
  • Is it safe to assume that you are looking to keep decimal places, and that instances of integers should be padded with `.00`? i.e. `10` becomes `10.00` and `10,00` becomes `10.00` – Simon Apr 14 '22 at 16:34
  • @Simon exactly! – Rafael Pereira Apr 14 '22 at 17:35
  • @Ynjxsjmh I believe it doesn't apply here, because the data is all interpreted as a localized number. The biggest problem is that part of the data uses the decimal separator "," and others ".", if they were all using just one separator it would just be a localization. – Rafael Pereira Apr 14 '22 at 17:39
  • `REGEXP_REPLACE(Price, '[^0-9]+', '.')` could be an oracle answer, simply replaces all non-numeric characters with `.` so that things are more consistent. – BeRT2me Apr 14 '22 at 17:42

3 Answers3

1

Looking at your code, the problem is that python is recognizing commas as decimal separator.

Therefore, you can change the comma from the cursor.fetchall() response and then construct the dataframe.

import pandas as pd
import cx_Oracle
import numpy as np

cx_Oracle.init_oracle_client(path to oracle client)

def connect(user, password, host):
    connection = cx_Oracle.connect(user=user, password = password, dsn = host)
    cursor = connection.cursor()
    return cursor

def sql(query,cursor):
    cursor.execute(query)
    result = cursor.fetchall()
    new_result = [[str(i).replace(',', '.') for i in r] for r in result]
    cols = [i[0] for i in cursor.description]
    df = pd.DataFrame(result, columns=[cols])
    return df

query = """
querie
"""

df = sql(query,cursor)
df.columns = df.columns.get_level_values(0)

if is still recognize Price column as string, you can convert using:

df['Price'] = df['Price'].astype(float)

Hope it helps!

  • what happens is that he doesn't fail to transform the numbers and they end up being transformed wrong, using the data above what i receive is: s1 --- 10.00 s2 --- 1000 s3 --- 10.00 – Rafael Pereira Apr 14 '22 at 16:04
  • @RafaelPereira Can you explain how `s2 10,00` becomes `1000`? His code simply replaces a comma with a period. – Simon Apr 14 '22 at 16:41
  • @RafaelPereira Try to explicitly the data type of the column to be read as a string. If you are reading a csv file, you can send the param dtype={ 'Price': str} and then convert to float replacing ',' to '.'. – Gabriel Doretto Apr 14 '22 at 17:07
  • @Simon I updated the question with the code, but I just download the base data and this conversion happens, I don't know why it happens – Rafael Pereira Apr 14 '22 at 17:33
  • @GabrielDoretto Data are saved on an Oracle database – Rafael Pereira Apr 14 '22 at 17:34
  • 1
    @RafaelPereira After seeing your code, one alternative is to change the value before transforming to dataframe, I will edit my answer with the code I've tested – Gabriel Doretto Apr 14 '22 at 17:59
1
# import
import pandas as pd

# test values
df = pd.DataFrame({'Sales': ['s1', 's2', 's3', 's4'], 'Price': ['10.00', '10,00', 10, 9]})

# convert all to string/object type for consistency
# can comment this out if all values are already string/object type
df['Price'] = df['Price'].astype(str)
# replace comma with period
df['Price'] = df['Price'].str.replace(',', '.')

# get index of values that do not have decimal places (period)
index = df[df['Price'].str.contains('\.') == False].index

# pad decimal to values that do not have decimal places
df.loc[index, 'Price'] = df.loc[index, 'Price'] + '.00'

As a last step you can optionally convert the values back to float/decimal if needed

Simon
  • 1,201
  • 9
  • 18
1

Easiest would be to inject a type handler. The following swaps commas and periods, but you can adjust it as needed. In a pure cx_Oracle example:

def type_handler(cursor, name, default_type, size, precision, scale):
    if default_type == oracledb.DB_TYPE_NUMBER:
        return cursor.var(oracledb.DB_TYPE_VARCHAR, arraysize=cursor.arraysize,
                outconverter=lambda v: v.replace('.', ','))

conn.outputtypehandler = type_handler
cursor.execute("select 2.5 from dual")
Christopher Jones
  • 9,449
  • 3
  • 24
  • 48