1

I am trying to parse a large .txt file with Pandas. The file is 1.6 GB in size. You can download the file here (it is a GeoNames database dump of all countries and settlements).

In regard to loading and parsing the file in Pandas, I consulted the answers here and here and this is what I have in code:

import pandas as pd

for chunk in pd.read_csv(
    "allCountries.txt",
    header=None,
    engine="python",
    sep=r"\s{1,}",
    names=[
        "geonameid",
        "name",
        "asciiname",
        "alternatenames",
        "latitude",
        "longitude",
        "feature class",
        "feature code",
        "country code",
        "cc2",
        "admin1 code",
        "admin2 code",
        "admin3 code",
        "admin4 code",
        "population",
        "elevation",
        "dem",
        "timezone",
        "modification date",
    ],
    chunksize=1000,
):
    print(chunk[0])  # just printing out the first row

If I run the code above, I get the following error:

ParserError: Expected 20 fields in line 1, saw 25. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.

I don't know what is going wrong here. Can someone tell me what is going wrong and how do I fix it?

AKX
  • 152,115
  • 15
  • 115
  • 172

2 Answers2

0

Your delimiter was wrong, since you have spaces in one column (names):

2986043 Pic de Font Blanca Pic de Font Blanca Pic de Font Blanca,Pic du Port 42.64991 1.53335 T PK AD 00 0 2860 Europe/Andorra 2014-11-05

It got parsed wrong.

This code works for me:

for chunk in pd.read_csv(
    "allCountries.txt",
    header=None,
    engine="python",
    sep=r"\t+",
    names=[
        "geonameid",
        "name",
        "asciiname",
        "alternatenames",
        "latitude",
        "longitude",
        "feature class",
        "feature code",
        "country code",
        "cc2",
        "admin1 code",
        "admin2 code",
        "admin3 code",
        "admin4 code",
        "population",
        "elevation",
        "dem",
        "timezone",
        "modification date",
    ],
    chunksize=1000,
):
    print(chunk)
Mathis Germa
  • 103
  • 1
  • 7
0

Opening the first 10 lines of the file with LibreOffice and using tab as delimiter worked fine

import csv
import pandas as pd

for chunk in pd.read_csv(
    'allCountries.txt',
    header=None,
    engine="python",
    sep="\t",
    names=[
        "geonameid",
        "name",
        "asciiname",
        "alternatenames",
        "latitude",
        "longitude",
        "feature class",
        "feature code",
        "country code",
        "cc2",
        "admin1 code",
        "admin2 code",
        "admin3 code",
        "admin4 code",
        "population",
        "elevation",
        "dem",
        "timezone",
        "modification date",
    ],
    quoting=csv.QUOTE_NONE,
    chunksize=1000
):
    print(chunk.iloc[0])  # just printing out the first row

The file also contains characters ' and ", which pandas by default assumes to be used for quoting and that caused errors but setting quoting to QUOTE_NONE fixed it.

Toivo Mattila
  • 377
  • 1
  • 9