I have a dataframe df
. Two of its columns ('neighborhood'
and 'price'
) contain strings. Each string in these two columns contain a number.
My goal is to create two lists containing only the numbers in the strings, and then overwrite the old columns in df
with the new lists, in such a way that pandas' .corr() will be able to recognize and operate on them.
Here is my current code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Import data
df = pd.read_csv('data.csv',sep='|')
# Clean data (throw away room_id, remove html from neighborhood, make price float )
df = df.drop('room_id', 1)
neighborhood = np.array([float(n[3:-4]) for n in df['neighborhood']]).astype(np.float64)
price = np.array([float(p[7:-1]) for p in df['price']]).astype(np.float64)
df_updates = pd.DataFrame({'neighborhood' : neighborhood, 'price' : price})
df.update(df_updates)
# Print first row of dataframe and the output of df.corr()
print(df.iloc[0])
print(df.corr())
# Print types
print(type(neighborhood[0]))
print(type(price[0]))
print(type(df['neighborhood'][0]))
print(type(df['price'][0]))
As seen below, .corr() doesn't recognize the new 'neighborhood'
and 'price'
columns as things it can operate on.
Out []:
room_type Entire home/apt
neighborhood 5
reviews 0
satisfaction 0
acc. 6
bedrooms 3
price 80
Name: 0, dtype: object
reviews satisfaction acc. bedrooms
reviews 1.000000 0.520951 -0.037194 -0.064366
overall_satisfaction 0.520951 1.000000 -0.019771 -0.052900
accommodates -0.037194 -0.019771 1.000000 0.720229
bedrooms -0.064366 -0.052900 0.720229 1.000000
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'float'>
<class 'float'>
<class 'numpy.int64'>
<class 'numpy.float64'>
I suspect the above happens because the entries of 'neighborhood'
and 'price'
are simple floats (instead of np.float64's) in the dataframe, even though the corresponding ndarrays contained np.float64's when being passed through .update.
Q: Why does this happen, and how can I fix it?