0

This is my very first question on stackoverflow, so I must beg your patience.

I believe there is something wrong with the format of a csv file I need to load into Python. I'm using a Jupyter Notebook. The link to the file is here. It is from the World Inequality Database data portal.

I'm pretty sure the delimiter is a semi-colon ( sep=";" ) because the bottom half of the data renders neatly when I specify this argument. However the first half of the text in the file seems to make no sense. I have no idea how to tell the pd.read_csv() function how to read it. I suspect the first half of the data simply has terrible formatting. I've also tried header=None and sep="|" to no avail.

Any ideas or suggestions would be very helpful. Thank you very much!

dbuckin
  • 1
  • 2
  • 1
    How about pasting the csv itself into this question so that it remains on stackoverflow forever? – tdelaney Mar 02 '20 at 17:51
  • @tdelaney At least a relevant part of the file. The whole thing may be too large. – Michael Butscher Mar 02 '20 at 17:55
  • 2
    it's fairly small. It may also be the most inconsistently formatted garbage I have ever seen. My advice on reading this would be "don't", but if you have to it's likely going to require some pre-processing before pandas will be able to make sense of it in a single call to `read_csv()`. This may also be of use: https://stackoverflow.com/questions/15242746/handling-variable-number-of-columns-with-pandas-python – KBriggs Mar 02 '20 at 17:59
  • as @KBriggs said, you will need to preprocess the data first before do anything else with pandas. – Y4RD13 Mar 02 '20 at 18:10
  • @KBriggs is right. But in the hopes of helping in general, it might be worth looking at passing `sep=None` to `read_csv()`. It takes longer but it tries to auto-detect separators. – Jake Stevens-Haas Mar 02 '20 at 18:10

1 Answers1

2

This is common with speadsheets. You have have some commentary, tables may be inserted all over the place. It looks great to the content creator, but the CSV is a mess. You need to preprocess the CSV to create clean content for your analysis. In this case, its easy. The content starts at canned header and you can split the file there. If that header changes, you'll get an error and now its just one more sleepless night figuring out what they've done.

import itertools

canned_header_line = "Variable Code;country;year;perc;agdpro999i;"\
    "npopul999i;mgdpro999i;inyixx999i;xlceux999i;xlcusx999i;xlcyux999i"

def scrub_WID_file(in_csv_filename, out_csv_filename):
    with open(in_csv_filename) as in_file,\
            open(out_csv_filename, 'w') as out_file:
        out_file.writelines(itertools.dropwhile(
            lambda line: line.strip() != canned_header_line,
            in_fp))
    if not os.stat.st_size:
        raise ValueError("No recognized header in " + in_csv_filename)
tdelaney
  • 73,364
  • 6
  • 83
  • 116
  • I noticed the "canned_header_line" too, but I couldn't read more than that. I can now read the metadata much better now! I think the easiest thing for me though is to just use `skiprows=8` and then use `cols = ['my', 'column', 'names', 'here',.....]` along with `names=cols` Thank you! – dbuckin Mar 03 '20 at 02:10