I am trying to read a deliminated text file into a dataframe in python. The deliminator is not being identified when I use pd.read_table
. If I explicitly set sep = ' '
, I get an error: Error tokenizing data. C error
. Notably the defaults work when I use np.loadtxt()
.
Example:
pd.read_table('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt',
comment = '%',
header = None)
0
0 1850 1 -0.777 0.412 NaN NaN...
1 1850 2 -0.239 0.458 NaN NaN...
2 1850 3 -0.426 0.447 NaN NaN...
3 1850 4 -0.680 0.367 NaN NaN...
4 1850 5 -0.687 0.298 NaN NaN...
If I set sep = ' ', I get another error:
pd.read_table('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt',
comment = '%',
header = None,
sep = ' ')
ParserError: Error tokenizing data. C error: Expected 2 fields in line 78, saw 58
Looking up this error, people suggest using header = None
(already done) and setting sep =
explicitly, but that is causing the problem: Python Pandas Error tokenizing data. I looked up line 78 and can't see any problems. If I set error_bad_lines=False
i get an empty df suggesting there is a problem with every entry.
Notably this works when I use np.loadtxt()
:
pd.DataFrame(np.loadtxt('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt',
comments = '%'))
0 1 2 3 4 5 6 7 8 9 10 11
0 1850.0 1.0 -0.777 0.412 NaN NaN NaN NaN NaN NaN NaN NaN
1 1850.0 2.0 -0.239 0.458 NaN NaN NaN NaN NaN NaN NaN NaN
2 1850.0 3.0 -0.426 0.447 NaN NaN NaN NaN NaN NaN NaN NaN
3 1850.0 4.0 -0.680 0.367 NaN NaN NaN NaN NaN NaN NaN NaN
4 1850.0 5.0 -0.687 0.298 NaN NaN NaN NaN NaN NaN NaN NaN
This suggests to me that there isn't something wrong with the file, but rather with how I am calling pd.read_table()
. I looked through the documentation for np.loadtxt()
in the hope of setting the sep to the same value, but that just shows: delimiter=None
(https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html).
I'd prefer to be able to import this as a pd.DataFrame
, setting the names, rather than having to import as a matrix
and then convert to pd.DataFrame
.
What am I getting wrong?