0

I have a dataframe df. Two of its columns ('neighborhood' and 'price') contain strings. Each string in these two columns contain a number.

My goal is to create two lists containing only the numbers in the strings, and then overwrite the old columns in df with the new lists, in such a way that pandas' .corr() will be able to recognize and operate on them.

Here is my current code:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Import data
df = pd.read_csv('data.csv',sep='|')

# Clean data (throw away room_id, remove html from neighborhood, make price float )
df = df.drop('room_id', 1)
neighborhood = np.array([float(n[3:-4]) for n in df['neighborhood']]).astype(np.float64)
price = np.array([float(p[7:-1]) for p in df['price']]).astype(np.float64)
df_updates = pd.DataFrame({'neighborhood' : neighborhood, 'price' : price})
df.update(df_updates)

# Print first row of dataframe and the output of df.corr()
print(df.iloc[0])
print(df.corr())

# Print types
print(type(neighborhood[0]))
print(type(price[0]))
print(type(df['neighborhood'][0]))
print(type(df['price'][0]))

As seen below, .corr() doesn't recognize the new 'neighborhood' and 'price' columns as things it can operate on.

Out []: 

room_type               Entire home/apt
neighborhood                          5
reviews                               0
satisfaction                          0
acc.                                  6
bedrooms                              3
price                                80
Name: 0, dtype: object

                      reviews               satisfaction  acc.      bedrooms
reviews               1.000000              0.520951     -0.037194 -0.064366
overall_satisfaction  0.520951              1.000000     -0.019771 -0.052900
accommodates         -0.037194             -0.019771      1.000000  0.720229
bedrooms             -0.064366             -0.052900      0.720229  1.000000

<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'float'>
<class 'float'>
<class 'numpy.int64'>
<class 'numpy.float64'>

I suspect the above happens because the entries of 'neighborhood' and 'price' are simple floats (instead of np.float64's) in the dataframe, even though the corresponding ndarrays contained np.float64's when being passed through .update.

Q: Why does this happen, and how can I fix it?

Bobson Dugnutt
  • 333
  • 1
  • 12
  • what are the original dtypes of `df`? can you try `print(df.dtypes)`. There may be a "funky" value in the column throwing it off – MattR Jan 17 '18 at 20:12
  • @MattR You're right! Both `price` and `neighborhood` are `object`! What to do though? – Bobson Dugnutt Jan 17 '18 at 20:14
  • the simple answer: find the funky values, haha. You can try using [`select_dtypes`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.select_dtypes.html) to filter out what might be causing it or some other way. – MattR Jan 17 '18 at 20:16
  • @MattR I figured it out, thanks! – Bobson Dugnutt Jan 17 '18 at 20:21

1 Answers1

0

Thanks to @MattR for giving the right hint! As seen with df.dtypes, the two problematic columns where objects, and .corr() can't handle that. The solution (as found in this answer) is to do the following after defining a list prob_cols containing the names of the problematic columns:

df[prob_cols] = df[prob_cols].apply(pd.to_numeric)
Bobson Dugnutt
  • 333
  • 1
  • 12