0

I'm having an issue in pandas where all column values (except the first) are returned as NaN when reading a CSV file AND ignoring header comments.

import pandas as pd

start_of_file = [
['# Accession: urn:mavedb:00000040-a-4'],
['# Downloaded (UTC): 2021-11-30 14:12:18.531917'],
['# Licence: CC BY-NC-SA 4.0 (Attribution-NonCommercial-ShareAlike)'],
['# Licence URL: https://creativecommons.org/licenses/by-nc-sa/4.0/'],
['accession', 'hgvs_nt', 'hgvs_splice', 'hgvs_pro', 'score'],
['urn:mavedb:00000040-a-4#1', 'NA', 'NA', 'p.Glu9Phe', '0.007373838825271998'],
]

# Export data frame...
pd.DataFrame(start_of_file).to_csv('test.csv', index=False)

# ... then read data frame while ignoring comments
pd.read_csv('test.csv', comment="#")
                         0        1            2         3      4
0                accession  hgvs_nt  hgvs_splice  hgvs_pro  score
1  urn:mavedb:00000040-a-4      NaN          NaN       NaN    NaN
irahorecka
  • 1,447
  • 8
  • 25
  • 1
    The problem is in the "urn:mavedb:00000040-a-4#1" field, where there is a "#" near the end, which indicates the rest of the line is a comment. You may also note the "#1" part is missing in your result. Because the rest of the line is ignored, Pandas doesn't find any values for the remaining four columns, resulting in NaNs. – 9769953 Nov 30 '21 at 16:04
  • An easy solution might be to use another comment character, that isn't used anywhere in a data field. Perhaps a "%", for example. – 9769953 Nov 30 '21 at 16:04
  • Great eye! I did not notice this... Hmm I'm fetching this information from an API, so I don't think I can manipulate the header comments prior to reading the CSV. I wonder if there's a way to indicate treat-as-comment if entry starts with '#'. – irahorecka Nov 30 '21 at 16:06
  • If you know how many lines the header is, you can also skip the header with the `skiprows` option; then you don't need to set the `comment` parameter. – 9769953 Nov 30 '21 at 16:06
  • What part of the above code is under your control? In your question, it appears you can actually write the data as well. – 9769953 Nov 30 '21 at 16:07
  • Ah yes, this was just to mock up the data I'm using. Usually I'll receive this file as a CSV filepath. The only part that is under my control is `pd.read_csv('test.csv', comment="#")`. – irahorecka Nov 30 '21 at 16:09
  • 1
    You can also just read the file line by line the old-fashioned way. Ignore the lines starting with a "#", put the other lines in a list (as you have in your example; don't forget to strip off any possible newlines), and directly create a dataframe from the list of lists – 9769953 Nov 30 '21 at 16:10
  • If you're getting the data from an API, I would think you fetch it through Python as well. In which case there would be no need for a CSV file: all data can be kept and handled in memory, and the API results can immediately be turned into a DataFrame. – 9769953 Nov 30 '21 at 16:11
  • Ok sounds good - I tried the old fashion way (create dataframe from list of lists) but it appears `pd.read_csv` does some coercion of types `pd.DataFrame` does not. I may be wrong about this. – irahorecka Nov 30 '21 at 16:13
  • With the given example data, all your columns are of string type, so any coercion doesn't play a role there. – 9769953 Nov 30 '21 at 16:26
  • Does this answer your question? [Get pandas.read\_csv to read empty values as empty string instead of nan](https://stackoverflow.com/questions/10867028/get-pandas-read-csv-to-read-empty-values-as-empty-string-instead-of-nan) – dank8 Mar 03 '23 at 02:34

1 Answers1

1
df = pd.read_csv('test.csv')
df.iloc[[index for index in range(len(df)) if '# ' not in  df['0'][index]]]
display(df)

    0   1   2   3   4
4   accession   hgvs_nt hgvs_splice hgvs_pro    score
5   urn:mavedb:00000040-a-4#1   NaN NaN p.Glu9Phe   0.007373838825271998

or

df = pd.read_csv('test.csv')
df.iloc[[index for index in range(len(df)) if '#' !=  df['0'][index][0]]]