0

I need parse file as where this link is given below. http://bit.ly/1x6yzoX

I wrote this fallowing method to parse this file, but unable to read incomplete data of latest year(2014) which empty spaces in table of text file. For now I am skipping the lines which I am unable to read.

Help me getting forward to how to handle this problem?.

LINES_TO_IGNORE = 7
import collections
import csv

def parse_file(data_file):
    result_dict = collections.OrderedDict()
    if not data_file:
        return result_dict

    with open(data_file) as f:
        reader = csv.reader(f, delimiter="\t")
        data = islice(reader, LINES_TO_IGNORE, None, None)
        if not data:
            return result_dict
        # Get file headers
        headers = data.next()
        headers = headers[0].split()
        keys = headers[1:]

        for row in data:
            values = row[0].split()
            if len(headers) == len(values):
                year = parse_to_int(values[0])
                data_list = [parse_to_float(x) for x in values[1:]]
                # Each line becomes a dict (column_header->value)
                data_dict = collections.OrderedDict(zip(keys, data_list))
            else:
                print "Skipping"
            # result_dict is dict of dict (year->data_dict)
            result_dict[year] = data_dict
    return result_dict
conrad
  • 1,783
  • 14
  • 28
Sreedhar
  • 367
  • 1
  • 3
  • 8
  • Similar questions: http://stackoverflow.com/questions/848537/writing-parsing-a-fixed-width-file-using-python and http://stackoverflow.com/questions/10686657/reading-data-from-text-file-with-missing-values – user2314737 Nov 05 '14 at 09:06

3 Answers3

1

You can do it easily with Pandas:

import pandas as pd
data = pd.read_fwf('UK.txt', skiprows=7, delimiter=' ')

Print the last few rows with print data[-3:]:

    Year    JAN    FEB    MAR    APR    MAY    JUN    JUL    AUG    SEP    OCT  \
102  2012    1.8    1.2    3.4    2.5    6.0    8.8...
103  2013    1.0   -0.1   -0.7    2.2    5.2    8.6...
104  2014    2.1    2.5    2.9    5.3    7.3    9.9...

     NOV    DEC     WIN    SPR    SUM    AUT   ANN  Unnamed: 3  Unnamed: 4  \
102  2.8    1.1    1.73   4.00  10.19   5.23  5.21         NaN         NaN
103  2.4    2.8    0.68   2.26  10.66   6.56  5.21         NaN         NaN
104                       2.48   5.17  10.46   NaN         NaN         NaN

     Unnamed: 5  Unnamed: 6  Unnamed: 7
102         NaN         NaN         NaN
103         NaN         NaN         NaN
104         NaN         NaN         NaN

I think this is not 100% right quite yet, but it's close...hopefully you can take it the rest of the way. No need to write so much code by hand if you use Pandas.

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
0

You can use the genfromtxt function from numpy

import numpy as np
data = np.genfromtxt('UK.txt',skiprows=8,delimiter=(4,7,7,7,7,7,7,7,7,7,7,7,7,8,7,7,7,8))

This will automatically fill the missing values, but you still need to find a way of identifying the sizes of the columns and the number of lines to skip.

Here is how to get the column sizes from the header:

import re
header="Year    JAN    FEB    MAR    APR    MAY    JUN    JUL    AUG    SEP    OCT    NOV    DEC     WIN    SPR    SUM    AUT     ANN"
cols=re.findall("\s*[^\s]+",header)
delimiter=tuple([len(c) for c in cols])
user2314737
  • 27,088
  • 20
  • 102
  • 114
0
def parse_file(data_file):
    result_dict = collections.OrderedDict()
    if not data_file:
        return result_dict

    with  open(data_file) as f:
        counter = 0
        headers = []
        for line in f.readlines():
            line = line.strip()
            counter += 1
            if counter == 1:
                headers = re.findall('\w+',line)
                keys = headers
            else:
                values =  re.findall('([\d\-\.]+|(?:\s){3,4})(?:(?:\s){3,4})?',line)
                year = parse_to_int(values[0])

                if len(headers) != len(values):
                    diff_list = ['NaN' for i in range(len(headers) - len(values))]
                    values.extend(diff_list)
                data_list = [parse_to_float(x) for x in values[1:]]
                data_dict = collections.OrderedDict(zip(keys, data_list))
                result_dict[year] = data_dict

    return result_dict
cfi
  • 10,915
  • 8
  • 57
  • 103
  • Welcome to SO! Please indent all code to highlight/format it accordingly. Answers are more likely to receive upvotes if you provide an explanation. – cfi Nov 05 '14 at 08:57