1

I am trying to read a deliminated text file into a dataframe in python. The deliminator is not being identified when I use pd.read_table. If I explicitly set sep = ' ', I get an error: Error tokenizing data. C error. Notably the defaults work when I use np.loadtxt().

Example:

pd.read_table('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt',
              comment = '%',
              header = None)
    0
0   1850 1 -0.777 0.412 NaN NaN...
1   1850 2 -0.239 0.458 NaN NaN...
2   1850 3 -0.426 0.447 NaN NaN...
3   1850 4 -0.680 0.367 NaN NaN...
4   1850 5 -0.687 0.298 NaN NaN...

If I set sep = ' ', I get another error:

pd.read_table('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt',
              comment = '%',
              header = None,
              sep = ' ')

ParserError: Error tokenizing data. C error: Expected 2 fields in line 78, saw 58

Looking up this error, people suggest using header = None (already done) and setting sep = explicitly, but that is causing the problem: Python Pandas Error tokenizing data. I looked up line 78 and can't see any problems. If I set error_bad_lines=False i get an empty df suggesting there is a problem with every entry.

Notably this works when I use np.loadtxt():

pd.DataFrame(np.loadtxt('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt', 
                        comments = '%'))
    0   1   2   3   4   5   6   7   8   9   10  11
0   1850.0  1.0     -0.777  0.412   NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
1   1850.0  2.0     -0.239  0.458   NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
2   1850.0  3.0     -0.426  0.447   NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
3   1850.0  4.0     -0.680  0.367   NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
4   1850.0  5.0     -0.687  0.298   NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN

This suggests to me that there isn't something wrong with the file, but rather with how I am calling pd.read_table(). I looked through the documentation for np.loadtxt() in the hope of setting the sep to the same value, but that just shows: delimiter=None (https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html).

I'd prefer to be able to import this as a pd.DataFrame, setting the names, rather than having to import as a matrix and then convert to pd.DataFrame.

What am I getting wrong?

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
MorrisseyJ
  • 1,191
  • 12
  • 19

2 Answers2

2

This one is quite tricky. Please try out the snippet code below:

import pandas as pd
url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'
df = pd.read_csv(url,
                 sep='\s+',
                 comment='%',
                 usecols=(0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11),
                 names=('Year', 'Month', 'M.Anomaly', 'M.Unc.', 'A.Anomaly',                                           
                        'A.Unc.','5y.Anomaly', '5y.Unc.' ,'10y.Anomaly', '10y.Unc.',
                        '20y.Anomaly', '20y.Unc.'))
tagoma
  • 3,896
  • 4
  • 38
  • 57
  • 1
    Thanks so much. That works. Given `"\s+"` i assume the problem was that there was more than one space for some records. Also, I don't think you need `usecols` if you specify `names`. – MorrisseyJ Sep 30 '20 at 22:15
2
  • The issue is the file has 77 rows of commented text, for 'Global Average Temperature Anomaly with Sea Ice Temperature Inferred from Air Temperatures'
    • Two of the rows are headers
  • There's a bunch of data, then there are two more headers, and a new set of data for 'Global Average Temperature Anomaly with Sea Ice Temperature Inferred from Water Temperatures'
  • This solution separates the two tables in the file into separate dataframes.
  • This is not as nice as the other answer, but the data is properly separated into different dataframes.
  • The headers were a pain, it would probably be easier to manually create a custom header, and skip the lines of code for separating the headers from the text.
  • The important point separating air and ice data.
import requests
import pandas as pd
import math

# read the file with requests
url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'
response = requests.get(url)
data = response.text

# convert data into a list
data = [d.strip().replace('% ', '') for d in data.split('\n')]

# specify the data from the ranges in the file
air_header1 = data[74].split()  # not used
air_header2 = [v.strip() for v in data[75].split(',')]

# combine the 2 parts of the header into a single header
air_header = air_header2[:2] + [f'{air_header1[math.floor(i/2)]}_{v}' for i, v in enumerate(air_header2[2:])]
air_data = [v.split() for v in data[77:2125]]

h2o_header1 = data[2129].split()  # not used
h2o_header2 = [v.strip() for v in data[2130].split(',')]

# combine the 2 parts of the header into a single header
h2o_header = h2o_header2[:2] + [f'{h2o_header1[math.floor(i/2)]}_{v}' for i, v in enumerate(h2o_header2[2:])]
h2o_data = [v.split() for v in data[2132:4180]]

# create the dataframes
air = pd.DataFrame(air_data, columns=air_header)
h2o = pd.DataFrame(h2o_data, columns=h2o_header)

Without the header code

  • Simplify the code, by using a manual header list.
import pandas as pd
import requests

# read the file with requests
url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'
response = requests.get(url)
data = response.text

# convert data into a list
data = [d.strip().replace('% ', '') for d in data.split('\n')]

# manually created header
headers = ['Year', 'Month', 'Monthly_Anomaly', 'Monthly_Unc.',
           'Annual_Anomaly', 'Annual_Unc.',
           'Five-year_Anomaly', 'Five-year_Unc.',
           'Ten-year_Anomaly', 'Ten-year_Unc.',
           'Twenty-year_Anomaly', 'Twenty-year_Unc.']

# separate the air and h2o data
air_data = [v.split() for v in data[77:2125]]
h2o_data = [v.split() for v in data[2132:4180]]

# create the dataframes
air = pd.DataFrame(air_data, columns=headers)
h2o = pd.DataFrame(h2o_data, columns=headers)

air

   Year Month Monthly_Anomaly Monthly_Unc. Annual_Anomaly Annual_Unc. Five-year_Anomaly Five-year_Unc. Ten-year_Anomaly Ten-year_Unc. Twenty-year_Anomaly Twenty-year_Unc.
0  1850     1          -0.777        0.412            NaN         NaN               NaN            NaN              NaN           NaN                 NaN              NaN
1  1850     2          -0.239        0.458            NaN         NaN               NaN            NaN              NaN           NaN                 NaN              NaN
2  1850     3          -0.426        0.447            NaN         NaN               NaN            NaN              NaN           NaN                 NaN              NaN

h2o

   Year Month Monthly_Anomaly Monthly_Unc. Annual_Anomaly Annual_Unc. Five-year_Anomaly Five-year_Unc. Ten-year_Anomaly Ten-year_Unc. Twenty-year_Anomaly Twenty-year_Unc.
0  1850     1          -0.724        0.370            NaN         NaN               NaN            NaN              NaN           NaN                 NaN              NaN
1  1850     2          -0.221        0.430            NaN         NaN               NaN            NaN              NaN           NaN                 NaN              NaN
2  1850     3          -0.443        0.419            NaN         NaN               NaN            NaN              NaN           NaN                 NaN              NaN
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
  • Oh wow, i missed this. Thanks very much. I am not sure how to handle this answer. The question i had was resolved through the use of `sep = '\s+'`, but that included a failure to fully comprehend this file. In that respect, this answer is the correct way to read this file. – MorrisseyJ Sep 30 '20 at 23:05
  • @MorrisseyJ it's up to you to accept and unaccept an answer. You may choose whichever answer is best for you. I've updated the answer, so now there are unique headers, combining both levels from the file. – Trenton McKinney Sep 30 '20 at 23:36