Pandas ignore first few rows before reading CSV

Question

I am using Pandas library and Python.

I have an Excel file that has some heading information on the top of an Excel sheet which I do not need for data extraction.

But, the heading information could take longer rows, so it is unpredictable how long it could be.

So, my data extraction should start from where it says "ID"... For this particular case, it starts from row 5, but it could change.

The image is shown on the bottom (I grayed out after row 5 for sensitive info).

How do I put this in logic (to skip heading and jump to row 5)? The pattern should be, row heading starts from "ID, EMP_ID" etc.

with open('File.xls') as fp:
    skip = next(filter(
        lambda x: x.startswith('ID'),
        enumerate(fp)
    ))[0]

df = pd.read_excel('File.xls', usercols=['ID', 'EMP_ID'], skiprows=skip)
print df

score 11 · Answer 1 · answered Jul 20 '17 at 18:09

11

You can use pd.read_csv and specify skiprows=4:

df = pd.read_csv('test.csv', skiprows=4)

answered Jul 20 '17 at 18:09

cs95

379,657
97
704
746

For this very specific example this code works, but I prefer the answer from a_guest, as it really identify where the table starts based on string match – BCArg Dec 21 '18 at 13:14
Well this is an amazing answer! Voted up! – Ice Bear Oct 04 '21 at 15:57

a_guest · Accepted Answer · 2017-07-21T22:45:06.723

10

You could manually check for the header line and then use read_csvs keyword argument skiprows.

with open('data.csv') as fp:
    skip = next(filter(
        lambda x: x[1].startswith('ID'),
        enumerate(fp)
    ))[0]

Then skip the rows:

df = pandas.read_csv('data.csv', skiprows=skip)

Like that you can support pre-header sections of arbitrary length.

For Python 2:

import itertools as it

with open('data.csv') as fp:
    skip = next(it.ifilter(
        lambda x: x[1].startswith('ID'),
        enumerate(fp)
    ))[0]

edited Jul 21 '17 at 22:45

answered Jul 20 '17 at 18:13

a_guest

34,165
12
64
118

I am getting this error. AttributeError: 'tuple' object has no attribute 'startswith' I have my code displayed on the top. – Java Jul 20 '17 at 20:17
@KingJava You're right, I fixed that. `enumerate` yields tuples, so we need to choose the second item. – a_guest Jul 20 '17 at 20:57
I am getting this error. TypeError: list object is not an iterator. – Java Jul 20 '17 at 21:22
Then I guess you're using Python 2. My answer is for Python 3, but there are multiple ways around it. You could do `from itertools import ifilter as filter` (or similarly use `ifilter`) or, instead of using `next`, you can subscript another time: `filter(...)[0][0]`. Or you can cast the result of `filter` to an iterator: `next(iter(filter(...)))[0]`. – a_guest Jul 20 '17 at 22:22
I just installed Python 3.4.3. Now getting UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte – Java Jul 21 '17 at 00:18
@KingJava Well it looks like your data file is not `utf-8` encoded. It's up to you to find out what encoding you used when generating it (you're probably on Windows). [This answer](https://stackoverflow.com/a/19706723/3767239) might help you out (they succeeded by using `ISO-8859-1` encoding). But you'll find plenty of other resources too when searching for that error. Anyway this is not really related to your original question. – a_guest Jul 21 '17 at 08:47
I am having so much difficulty using Python 3.4.3 at PyCharm.I am pretty much stuck after [this my question today](https://stackoverflow.com/questions/45244238/importerror-no-module-named-xlrd). I think I need to go back to 2.7. How do I re-write the code that you wrote in 2.7. I have hard time understanding your comment on the top. Appreciated. – Java Jul 21 '17 at 20:12
1

I got this mesage "StopIteration" on the line where it has "enumerate(fp)" – Java Jul 21 '17 at 23:58
@KingJava Well that is because apparently none of the lines in the data file starts with `"ID"`. You have to make sure the format is correct, or check for the appropriate header line prefix. – a_guest Jul 22 '17 at 09:03
My Excel file has a column that starts with "ID" on line 5 (as shown on the top image). Would it matter if I change from read_csv to read_excel ? But, I do not see difference of error. It still says "StopIteration" – Java Jul 22 '17 at 20:35
You don't need an Excel file for the above to work, but a csv file. So go to your Excel and hit the "Export CSV" (or similar) button and make a CSV file out of the Excel file. The part which determines the line number scans through a plain text file (such as csv) and doesn't work on whatever format Excel uses. Changing from `read_csv` to `read_excel` doesn't help, obviously, because the error occurs somewhere else (which you can see from the line number in the Traceback). – a_guest Jul 22 '17 at 20:56
By changing the file extension from xls to csv works! Thanks! – Java Jul 22 '17 at 21:08
@KingJava Well I actually meant that there should be an option in Excel to export the file in CSV format but if it worked - great! Could you please accept the answer, so the question is off the list? – a_guest Jul 23 '17 at 00:47
Where is "accept the answer" option? Regards to previous comment, so, I cannot use your code if the file format is .xls, correct? It only works if the file is .csv file, correct? – Java Jul 23 '17 at 01:02
@KingJava It's the tick below the up- and down-vote buttons. You need to click it. You are right, the code above works with files in csv format. – a_guest Jul 23 '17 at 01:50

Pandas ignore first few rows before reading CSV

2 Answers2

Linked

Related