26

I have table formatted as follow :

foo - bar - 10 2e-5 0.0 some information
quz - baz - 4 1e-2 1 some other description in here

When I open it with pandas doing :

a = pd.read_table("file", header=None, sep=" ")

It tells me :

CParserError: Error tokenizing data. C error: Expected 9 fields in line 2, saw 12

What I'd basically like to have is something similar to the skiprows option which would allow me to do something like :

a = pd.read_table("file", header=None, sep=" ", skipcolumns=[8:])

I'm aware that I could re-format this table with awk, but I'd like to known whether a Pandas solution exists or not.

Thanks.

jrjc
  • 21,103
  • 9
  • 64
  • 78
  • If you want to be able to use column names, see answer in duplicate here: https://stackoverflow.com/questions/49677313/skip-specific-set-of-columns-when-reading-excel-frame-pandas/56252452#56252452 – MarMat May 22 '19 at 08:30

2 Answers2

31

The usecols parameter allows you to select which columns to use:

a = pd.read_table("file", header=None, sep=" ", usecols=range(8))

However, to accept irregular column counts you need to also use engine='python'.

otus
  • 5,572
  • 1
  • 34
  • 48
  • 1
    No, that doesn't work the error still persists, the issue here is to coerce the parser to only use the desired columns without raising an error due to incorrect formatting – EdChum Jun 23 '14 at 13:07
  • 1
    @EdChum, ah, ok, in that case this is probably a duplicate: http://stackoverflow.com/questions/15242746/handling-variable-number-of-columns-with-pandas-python?rq=1 – otus Jun 23 '14 at 13:08
  • @otus yes changing the engine to python from the default works, you should post that as an edit – EdChum Jun 23 '14 at 13:13
  • Do you manage to make it work with `engine="python"`? It's work without `usecols`, but then, the DataFrame is weird – jrjc Jun 23 '14 at 13:40
  • And isn't python engine slower than the C one ? My files are somehow big. – jrjc Jun 23 '14 at 13:46
  • @jeanrjc yes I believe that the python engine is slower so you have to make a choice between ease of use against speed with a little data wrangling here – EdChum Jun 23 '14 at 13:49
0

If you are using Linux/OS X/Windows Cygwin, you should be able to prepare the file as follows:

cat your_file |  cut -d' ' -f1,2,3,4,5,6,7 > out.file

Then in Python:

a = pd.read_table("out.file", header=None, sep=" ")

Example:

Input:

foo - bar - 10 2e-5 0.0 some information
quz - baz - 4 1e-2 1 some other description in here

Output:

foo - bar - 10 2e-5 0.0
quz - baz - 4 1e-2 1

You can run this command manually on the command-line, or simply call it from within Python using the subprocess module.

Martin Konecny
  • 57,827
  • 19
  • 139
  • 159