0

enter image description here

My files have two formats ...some have # lines in the begining and some dont. I want to read_csv the matrix above into pandas dataframe and want to ignore the rows with # before populating my dataframe. My headers should be the ID SID and AID and so on.....so i think i can read a file by skipping the first 4 rows and i know how to do that. But the problem is there are files where the rows donot have first 4 # rows and directly start with ID SID AID....headers.

When i read in the data frame, i guess it assigns the col name as #PI

kaya3
  • 47,440
  • 4
  • 68
  • 97
RnD
  • 1,172
  • 4
  • 15
  • 25
  • Possible duplicate of [How to drop rows from pandas data frame that contains a particular string in a particular column?](https://stackoverflow.com/questions/28679930/how-to-drop-rows-from-pandas-data-frame-that-contains-a-particular-string-in-a-p) – Nazim Kerimbekov Feb 04 '19 at 21:48
  • This is not a duplicate question because the link that you are mentioning is after the file has been read into a dataframe and the column has a header name – RnD Feb 04 '19 at 21:54

2 Answers2

3

The pandas read_csv function allows you to specify a comment character via comment='#'. This will ignore any lines that begin with #.

Tom Johnson
  • 1,793
  • 1
  • 13
  • 31
0

Why not just read in all rows using read_csv and then filter out lines with # using .loc?

Something like

df.loc[~df['col'].str.startswith('#')]
Jab
  • 26,853
  • 21
  • 75
  • 114
piedpiper
  • 328
  • 2
  • 13
  • Whatever column 0 is. You did not show the column labels in your picture so he used `'col'` – Jab Feb 04 '19 at 22:14
  • correct, thank you Jaba! just replace 'col' with whatever the column heading for column 0 is – piedpiper Feb 04 '19 at 22:15
  • ohh I think i understand your question now. In the picture you linked, you want the dataframe to be everything from the 5th row onwards, with the 5th row being the column headings? in that case, just specify the header argument with the index of the row you want to be the header. i.e `pd.read_csv('file.csv',header=4)`. But I'm not sure how to specifically just to exclude # rows. – piedpiper Feb 04 '19 at 22:33
  • yes you re correct but sometimes i have the first 4 rows with the # in them and sometimes it directly starts with ID SID AID...... headers – RnD Feb 04 '19 at 22:38
  • i see. i think it's easiest to just read in the csv as is and specifying `header=None`, and then use the statement in my original answer but just put the column index instead of the column name to filter out rows with '#'s, and then set the column names to whatever you want. but i'm guessing someone might suggest a more elegant solution. – piedpiper Feb 04 '19 at 22:46