Handling empty space in reading table from text file python

Question

I need parse file as where this link is given below. http://bit.ly/1x6yzoX

I wrote this fallowing method to parse this file, but unable to read incomplete data of latest year(2014) which empty spaces in table of text file. For now I am skipping the lines which I am unable to read.

Help me getting forward to how to handle this problem?.

LINES_TO_IGNORE = 7
import collections
import csv

def parse_file(data_file):
    result_dict = collections.OrderedDict()
    if not data_file:
        return result_dict

    with open(data_file) as f:
        reader = csv.reader(f, delimiter="\t")
        data = islice(reader, LINES_TO_IGNORE, None, None)
        if not data:
            return result_dict
        # Get file headers
        headers = data.next()
        headers = headers[0].split()
        keys = headers[1:]

        for row in data:
            values = row[0].split()
            if len(headers) == len(values):
                year = parse_to_int(values[0])
                data_list = [parse_to_float(x) for x in values[1:]]
                # Each line becomes a dict (column_header->value)
                data_dict = collections.OrderedDict(zip(keys, data_list))
            else:
                print "Skipping"
            # result_dict is dict of dict (year->data_dict)
            result_dict[year] = data_dict
    return result_dict

Similar questions: http://stackoverflow.com/questions/848537/writing-parsing-a-fixed-width-file-using-python and http://stackoverflow.com/questions/10686657/reading-data-from-text-file-with-missing-values — user2314737, Nov 05 '14 at 09:06

John Zwinck · Answer 1 · 2014-11-05T07:59:32.967

You can do it easily with Pandas:

import pandas as pd
data = pd.read_fwf('UK.txt', skiprows=7, delimiter=' ')

Print the last few rows with print data[-3:]:

    Year    JAN    FEB    MAR    APR    MAY    JUN    JUL    AUG    SEP    OCT  \
102  2012    1.8    1.2    3.4    2.5    6.0    8.8...
103  2013    1.0   -0.1   -0.7    2.2    5.2    8.6...
104  2014    2.1    2.5    2.9    5.3    7.3    9.9...

     NOV    DEC     WIN    SPR    SUM    AUT   ANN  Unnamed: 3  Unnamed: 4  \
102  2.8    1.1    1.73   4.00  10.19   5.23  5.21         NaN         NaN
103  2.4    2.8    0.68   2.26  10.66   6.56  5.21         NaN         NaN
104                       2.48   5.17  10.46   NaN         NaN         NaN

     Unnamed: 5  Unnamed: 6  Unnamed: 7
102         NaN         NaN         NaN
103         NaN         NaN         NaN
104         NaN         NaN         NaN

I think this is not 100% right quite yet, but it's close...hopefully you can take it the rest of the way. No need to write so much code by hand if you use Pandas.

,Are you getting this output - Please give me full idea how to get it worked ? — Sreedhar, Nov 05 '14 at 09:25

user2314737 · Accepted Answer · 2014-11-05T09:02:40.970

0

You can use the genfromtxt function from numpy

import numpy as np
data = np.genfromtxt('UK.txt',skiprows=8,delimiter=(4,7,7,7,7,7,7,7,7,7,7,7,7,8,7,7,7,8))

This will automatically fill the missing values, but you still need to find a way of identifying the sizes of the columns and the number of lines to skip.

Here is how to get the column sizes from the header:

import re
header="Year    JAN    FEB    MAR    APR    MAY    JUN    JUL    AUG    SEP    OCT    NOV    DEC     WIN    SPR    SUM    AUT     ANN"
cols=re.findall("\s*[^\s]+",header)
delimiter=tuple([len(c) for c in cols])

edited Nov 05 '14 at 09:02

answered Nov 05 '14 at 08:05

user2314737

27,088
20
102
114

Please can you explain how is the value of delimeter is decided in your code? – Sreedhar Nov 06 '14 at 05:58
by the way , It is handling the missing data from file . Just need how do we decide delimiter? – Sreedhar Nov 06 '14 at 06:31
I showed how to get the delimiters tuple from the headers line in the second part of the answer. – user2314737 Nov 06 '14 at 10:14

score 0 · Answer 3 · edited Nov 05 '14 at 08:57

def parse_file(data_file):
    result_dict = collections.OrderedDict()
    if not data_file:
        return result_dict

    with  open(data_file) as f:
        counter = 0
        headers = []
        for line in f.readlines():
            line = line.strip()
            counter += 1
            if counter == 1:
                headers = re.findall('\w+',line)
                keys = headers
            else:
                values =  re.findall('([\d\-\.]+|(?:\s){3,4})(?:(?:\s){3,4})?',line)
                year = parse_to_int(values[0])

                if len(headers) != len(values):
                    diff_list = ['NaN' for i in range(len(headers) - len(values))]
                    values.extend(diff_list)
                data_list = [parse_to_float(x) for x in values[1:]]
                data_dict = collections.OrderedDict(zip(keys, data_list))
                result_dict[year] = data_dict

    return result_dict

Welcome to SO! Please indent all code to highlight/format it accordingly. Answers are more likely to receive upvotes if you provide an explanation. — cfi, Nov 05 '14 at 08:57

Handling empty space in reading table from text file python

3 Answers3