Python - Cannot convert string to float due to headers being captured in the conversion

Question

I'm trying to convert specific columns from string to float in Python but I'll always end up with the error:

cannot convert string to float: 'Tour Delay Minutes'

Tour Delay Minutes is the name of a specific column and contains values such as 6.31 or integers (in case the result is an integer) like 9,10 etc. My code is:

import pandas as pd
import numpy as np

data = pd.read_csv('H:\\testing.csv',thousands = ',')
data.drop([0], axis=1) #Removes the header? based on another post
cols=['Tour Delay Minutes','Passenger Delay Minutes','Driver Delay Minutes','Engine Failures','Vehicle Failures'] #Columns containing ints and floats 
for col in cols: #Loop to transform all column strings to floats by default
    data[col]= data[col].astype(dtype=np.float64)
data.info()

The dtypes specified at point of loading are:

Unnamed: 0                         int64
Time Period                       object #contains day,midday,early afternoon
Tour Number                       object #contains integers
Tour Delay Minutes                object #contains float numbers
Passenger Delay Minutes           object #contains float numbers
Driver Delay Minutes              object #contains float numbers
Engine Failures                   object #contains integer numbers
Vehicle Failures                  object #contains integer numbers

I suppose the error will also apply for all other columns marked as object (as shown above) and that's because Python tries to convert the header as well (row 1). Any workarounds on this please? I've also tried the below code, however it didn't work:

data['Tour Delay Minutes'].astype(str).astype(float)

Edit: Sample dataset added to assist in finding the solution - PLease see link:

https://i.stack.imgur.com/o4zcX.png

 Unnamed: 0 (index) Time Period Tour Number Tour Delay Minutes  Passenger Delay Minutes Driver Delay Minutes    Engine Failures Vehicle Failures
0   2018/19-P08 261803  11  6   5   2   0
1   2018/19-P08 325429  16  12  4   0   0
2   2018/19-P08 359343  14  5   9   0   0
3   2018/19-P08 366609  18  10  8   0   0
4   2018/19-P08 370697  63  37  26  2   0
5   2018/19-P08 392535  1474    140 1334    37.1194012  0.022591857
6   2018/19-P09 394752  0   0   0   0   0
7   2018/19-P09 408713  31  13  18  1.25    0
8   2018/19-P09 433763  62  49  13  4.766666667 1
9   2018/19-P09 440100  0   0   0   1   1
10  2018/19-P09 440258  17  14  3   1   0
11  2018/19-P10 440280  46  46  0   2.933333333 2
12  2018/19-P10 440929  22  7   15  1   0
13  2018/19-P10 441110  26  13  13  0   0
14  2018/19-P10 441585  4   0   4   0   0
15  2018/19-P10 442092  39  12  27  1.923076923 0
16  2018/19-P11 442105  0   0   0   0   0
17  2018/19-P11 442173  3   0   3   0   0
18  2018/19-P11 443580  4   2   2   0.428571429 0
19  2018/19-P11 443594  3   2   1   0.285714286 0
20  2018/19-P12 443599  2   1   1   0.285714286 0
21  2018/19-P12 443709  5   0   5   0   0
22  2018/19-P12 443885  3   0   3   0   0
23  2018/19-P12 444040  15  9   6   0.857142857 0
24  2018/19-P12 445021  3   0   3   0   0

Edit 2: Added actual sample dataset - Image link still available

Can you please add the actual sample dataset and not the screenshot image of the dataset so that people can work on it? thank you — Jithin P James, Mar 19 '19 at 08:47

makman · Accepted Answer · 2019-03-21T10:17:46.343

After a couple of days, testing the file and running different scripts I think I resolved my issue.

The issue was that duplicate headers were being added from another CSV combine script and that caused problems when trying to convert the columns of my master file.

My question now is, how can I eliminate the duplicate headers from the master CSV file (approx. 17 million rows)?

Edit: Duplicate headers issue resolved

I followed the advice provided by jezrael and was able to successfully eliminate/remove all duplicate rows containing the headers. Link attached for those interested. A big thank you to the users who contributed to resolving this issue.

 y = x[~x['Time Period'].str.contains('Time Period')]
#The above helped me remove all applicable rows that contained the string "Time Period

    y.to_csv("H:\\modded.csv")
data1 = pd.read_csv("H:\\modded.csv")
data1.dtypes
#I then save "y" as new CSV file, load the new dataset and voila the columns containing numbers have changed dtypes to float64.

How to drop rows from pandas data frame that contains a particular string in a particular column?

If you encountered similar issue and this answer has resolved it, please don't forget to upvote both questions and answers.

If you are still wondering about an answer I recommend asking another separate question and then link it here in the comments. — Progress, Apr 09 '22 at 15:10
Hi there, please read my answer above including the attached question. I was able to resolve both issues by referring to the attached question and accepted solution. :) — makman, Oct 24 '22 at 15:16

Python - Cannot convert string to float due to headers being captured in the conversion

1 Answers1