0

I am reading a .tsv in pandas with the following commands:

Gene_Data = pd.read_csv(
        genes_input_path,
        sep='\t+',
        header=None,
        engine='python',
        names=col_names,
        usecols=col_to_use,
        comment='#')

   
Gene_Data.Stop = Gene_Data.Stop.astype('float32')

Where the input data looks something like:

    chr7    HAVANA  gene    117287120       117715971       .       +       .       ID=ENSG00000001626.16;gene_id=ENSG00000001626.16;gene_type=protein_coding;gene_name=CFTR;level=1;hgnc_id=HGN

And the Stop column corresponds to the 4th column when using 0 indexing. When I perform the astype conversion to float32 it ends up changing that Stop column value to 117715968. Rather than returning the actual value of 117715971.

When I disable the type conversion, it keeps the value as int64, and the value is correct. I don't understand why it is changing the inherent value when performing the conversion, does anyone have any thoughts?

  • Because float32 doesn't have enough precision to store that number exactly. Who not `float64` or `int64`? – ALollz Mar 19 '21 at 19:53
  • I didn't think I was running up against the limit for float32 usage. Is that what is happening here? – DjNeckbrace Mar 19 '21 at 20:00
  • @DjNeckbrace It's not about the _limit_ of float32, it's about the precision, or loss of precision in this case. Read [more about that here](https://stackoverflow.com/q/588004/1431750). – aneroid Mar 19 '21 at 20:45

0 Answers0