1

Recently I have been developing some code to read a csv file and store key data columns in a dataframe. Afterwards I plan to have some mathematical functions performed on certain columns in the dataframe.

I've been fairly successful in storing the correct columns in the dataframe. I have been able to have it do whatever maths is necessary such as summations, additions of dataframe columns, averaging etc.

My problem lies in accessing specific columns once they are stored in the dataframe. I was working with a test file to get everything working and managed this no problem. The problems arise when I open a different csv file, it will store the data in the dataframe, but the accessing the column I want no longer works and it stops at the calculation part.

From what I can tell the problem lies with how it reads the column name. The column names are all numbers. For example, df['300'], df['301'] etc. When accessing the column df['300'] works fine in the testfile, while the next file requires df['300.0']. If I switch to a different file it may require df['300'] again. All the data was obtained in the same way so I am not certain why some are read as 300 and the others 300.0.

Short of constantly changing the column labels each time I open a different file, is there anyway to have it automatically distinguish between '300' and '300.0' when opening the file, or force '300.0' = '300'?

Thanks

Si M
  • 19
  • 1
  • Could you not just do df[str(int(300.0))] ? – SuperStew Nov 13 '17 at 21:49
  • "is there anyway to have it automatically distinguish between `'300'` and `'300.0'` when opening the file, or force `'300.0'` = `'300'`?" *You* need to decide which one is appropriate for your problem. I'd bet $10 that they would give you different results. – jpmc26 Nov 14 '17 at 01:47

3 Answers3

1

In your dataframe df, one way to keep consistency may be to convert to similar types of columns. You can update all the column name to string value of integer from float i.e. '300.0' to '300' using .columns as below. Then, I think using integer value of string should work i.e. df['300] or any other columns other than 300.

df.columns = [str(int(float(column))) for column in df.columns]

Or, if integer value is not required,extra int conversion can be removed and float string value can be used:

df.columns = [str(float(column)) for column in df.columns]

Then, df['300.0'] can be used instead of df['300'].

If string type is not required then, I think converting them float would work as well.

df.columns = [float(column) for column in df.columns]

Then, df[300.0] would work as well.

Other alternative to change column names may be using map:

Changing to float value for all columns, then as mentioned above use df[300.0]:

df.columns = map(float, df.columns)

Changing to string value of float, then df['300.0']:

df.columns = map(str, map(float, df.columns))

Changing to string value of int, then df['300']:

df.columns = map(str, map(int, map(float, df.columns)))

niraj
  • 17,498
  • 4
  • 33
  • 48
  • From what I have read, writing to df.columns doesn't work because of some kind of internal state that pandas keeps separately, and you need to use the 'rename' function (my answer has an example). I haven't downvoted though, I'm not sure if my information is out of date. – mbrig Nov 13 '17 at 22:32
  • @mbrig Not sure, but it worked with test data I used, using pandas version `0.21.0` – niraj Nov 13 '17 at 22:33
  • 1
    It might just be that changing individual column names (like `df.columns[5] = 'new'`) doesn't work, I can't tell. I'll leave my answer up as an alternative but yours looks simpler so +1. – mbrig Nov 13 '17 at 22:37
0

Some solutions:

  1. Go through all the files, change the columns names, then save the result in a new folder. Now when you read a file, you can go to the new folder and read it from there.

  2. Wrap the normal file read function in another function that automatically changes the column names, and call that new function when you read a file.

  3. Wrap column selection in a function. Use a try/except block to have the function try to access the given column, and if it fails, use the other form.

Acccumulation
  • 3,491
  • 1
  • 8
  • 12
0

This answer assumes you want only the integer part to remain in the column name. It takes the column names and does a float->int->string conversion to strip the decimal places.

Be careful, if you have numbers like '300.5' as a column name, this will turn them into '300'.

cols = df.columns.tolist()
new_columns = dict([(c,str(int(float(c)))) for c in cols])
df = df.rename(columns = new_columns)

For clarity, most of the 'magic' is happening on the middle line. I iterate over the currently existing columns, and turn them into tuples of the form (old_name, new_name). df.rename takes that dictionary and then does the renaming for you.

My thanks to user Nipun Batra for this answer that explained df.rename.

mbrig
  • 929
  • 12
  • 16