I'm trying to create a new column in my pandas dataframe which will be the difference of two other columns, but the new column has values that are significantly different what what the differences between the values of the columns are. I have heard that 'float' values often don't subtract precisely, so I have tried to convert the decimal values here to integers by changing the columns' dtypes to 'int64' (as suggested here Pandas Subtract Two Columns Not Working Correctly) and then multiplying each value by 100000:
# Read in data
data = pd.read_csv('/Users/aaron/Downloads/treatment_vs_control.csv')
# Cleaning and preprocessing
data.rename({"names": "Gene"}, axis=1, inplace=True)
columns = data.columns
garbage_filter = columns.str.startswith('names') | columns.str.startswith('0') | columns.str.startswith('1') | \
columns.str.startswith('Gene')
data = data.loc[:,garbage_filter]
scores_filter = columns.str.endswith('scores')
columns = data.columns
scores_filter = columns.str.endswith('scores')
data = data.iloc[:,~scores_filter]
## To create Diff columns correctly, change logFC columns to integer dtype
data = data.astype({'1_logfoldchanges': 'int64', '0_logfoldchanges': 'int64'})
data['1_logfoldchanges'] = data['1_logfoldchanges'] * 100000
data['0_logfoldchanges'] = data['0_logfoldchanges'] * 100000
data["diff_logfoldchanges0"] = data['0_logfoldchanges'] - data['1_logfoldchanges']
data["diff_logfoldchanges1"] = data['1_logfoldchanges'] - data['0_logfoldchanges']
data['1_logfoldchanges'] = data['1_logfoldchanges'] / 100000
data['0_logfoldchanges'] = data['0_logfoldchanges'] / 100000
data['diff_logfoldchanges0'] = data['diff_logfoldchanges0'] / 100000
data['diff_logfoldchanges1'] = data['diff_logfoldchanges1'] / 100000
data = data.astype({'1_logfoldchanges': 'float64', '0_logfoldchanges': 'float64'})
data.sort_values('diff_logfoldchanges0', ascending=False, inplace=True)
The values in the new column still do not equal the difference in the two original columns and I haven't been able to find any questions on this site or others that have been able to help me resolve this. Could someone point out how I could fix this? I would be extremely grateful for any help.
For reference, here is a snapshot of my data with the incorrect difference-column values:
EDIT: Here is a a bit of my CSV data too:
names,0_scores,0_logfoldchanges,0_pvals,0_pvals_adj,1_scores,1_logfoldchanges,1_pvals,1_pvals_adj,2_scores,2_logfoldchanges,2_pvals,2_pvals_adj,3_scores,3_logfoldchanges,3_pvals,3_pvals_adj,4_scores,4_logfoldchanges,4_pvals,4_pvals_adj,5_scores,5_logfoldchanges,5_pvals,5_pvals_adj,6_scores,6_logfoldchanges,6_pvals,6_pvals_adj,7_scores,7_logfoldchanges,7_pvals,7_pvals_adj,8_scores,8_logfoldchanges,8_pvals,8_pvals_adj
0610005C13Rik,-0.06806567,-1.3434665,0.9457333570044608,0.9996994148075796,-0.06571575,-2.952315,0.9476041278614572,0.9998906553041256,0.17985639,1.9209933,0.8572653106998014,0.9994124851941415,-0.0023527155,0.85980946,0.9981228063933416,0.9993920957240323,0.0021153346,0.08053488,0.9983122084427253,0.9993417421686092,0.07239167,2.6473796,0.9422902189641795,0.9998255096296015,-0.029918168,-18.44805,0.9761323166853361,0.998901292435457,-0.021452557,-18.417543,0.9828846479876278,0.9994515175269552,-0.011279659,-18.393742,0.9910003250967939,0.9994694916208285
0610006L08Rik,-0.015597747,-15.159286,0.9875553033428832,0.9996994148075796,-0.015243248,-15.13933,0.9878381189626457,0.9998906553041256,-0.008116434,-14.795435,0.9935240935555751,0.9994124851941415,-0.0073064035,-14.765995,0.9941703851753109,0.9993920957240323,-0.0068988753,-14.752146,0.9944955375479378,0.9993417421686092,0.100005075,18.888618,0.9203402935001026,0.9998255096296015,-0.004986361,-14.696446,0.9960214758176429,0.998901292435457,-0.0035754263,-14.665947,0.9971472286106732,0.9994515175269552,-0.0018799432,-14.64215,0.9985000232597367,0.9994694916208285
0610009B22Rik,0.7292792,-0.015067068,0.46583086269639506,0.9070814087688549,0.42489842,0.18173021,0.67091072915639,0.9998906553041256,17.370018,1.0877438,1.3918130408174961e-67,6.801929840389262e-67,-6.5684495,-1.237505,5.084194721546539e-11,3.930798968247645e-10,-5.6669636,-0.42557448,1.4535041077956595e-08,5.6533712043729706e-08,-3.5668032,-0.5939982,0.0003613625764821466,0.001766427013499565,-7.15373,-1.7427195,8.445118618740649e-13,4.689924532441606e-12,-2.6011736,-0.66274893,0.009290541915973735,0.05767076032846401,1.7334439,1.2316034,0.08301681426158236,0.3860271115408991
Ideally, I'd like to create a 'diff_logfoldchanges0' column that is equal to the values from the '0_logfoldchanges' column minus the values from the '1_logfoldchanges' column. In the CSV data below, I believe that might be "-1.3434665 - -2.952315", "-15.159286 - -15.13933", and "-0.015067068 - 0.18173021".