How to read space delimited data, two row types, no fixed width and plenty of missing values?

Question

There's lots of good information out there on how to read space-delimited data with missing values if the data is fixed-width.

I'm currently trying to read Japan's Meteorological Agency typhoon history data which is supposed to have this format, but doesn't actually:

# Header rows:
    5    10   15   20   25   30   35   40   45   50   55   60   65   70   75   80
::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|
AAAAA BBBB  CCC DDDD EEEE F G HHHHHHHHHHHHHHHHHHHH              IIIIIIII

# Data rows:
    5    10   15   20   25   30   35   40   45   50   55   60   65   70   75   80
::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|
AAAAAAAA BBB C DDD EEEE FFFF     GGG     HIIII JJJJ KLLLL MMMM         P

It is very similar to NOAA's hurricane best track data, except that it comma delimited, and missing values were given -999 or NaN, which simplified reading the data. Additionally, Japan's data doesn't actually follow the advertised format. For example, column FFFF in the data rows don't always have width 4. Sometimes it has width 3.

I must say that I'm at a complete loss as how to process this data into a dataframe. I've investigated the pd.read_fwf method, and it initially looked promising until I discovered the malformed columns and the two different row types.

My question:

How can I approach cleaning this data and getting it into a dataframe? I'd just find a different dataset, but honestly I can't find any comprehensive typhoon data anywhere else.

score 1 · Accepted Answer · answered Mar 18 '21 at 21:25

I went a little deep for you here, because I'm assuming you're doing this in the name of science and if I can help someone trying to understand climate change then its a good cause.

After looking the data over I've noticed the issue is relating to the data being stored in a de-normalized structure. There are 2 ways you can approach this issue off the top of my head. Re-Writing the file to another file to load into pandas or dask is what I'll show, since thats probably the easiest way to think about it (but certainly not the most efficient for those that will inevitably roast me in the comments)

Think of this like its Two Separate Tables, with a 1-to-Many relationship. 1 table for Typhoons and another for the data belonging to a given typhoon.

A decent, but not really efficient way would be to rewrite it to a better nested structure, like JSON. Then load the data in using that. Note the 2 distinct types of columns.

Step 1: map the data out

There are really 2 tables in one table here. Each typhoon is going to show up as a row that appears like this:
66666 9119 150 0045 9119 0 6 MIRREILE 19920701

While the records for that typhoon are going to follow that row (think of this as a separate row:
20080100 002 3 178 1107 994 035 00000 0000 30600 0200

Load the File in, reading it as raw lines. By using the .readlines() method, we can read each individual line in as an item in a list.

# load the file as raw input
with open('./test.txt') as f:
    lines = f.readlines()

Now that we have that read in, we're going to need to perform some logic to separate some lines from others. It appears the every time there is a Typhoon record, the line is preceded with a '66666', so lets key off that. So, given we look at each individual line in a horribly inefficient loop, we can write some if/else logic to have a look:

if row[:5] == '66666':
   # do stuff
else:
   # do other stuff

Thats going to be a pretty solid way to separate that logic for now, which will be useful to guide splitting that up. Now, we need to write a loop that will check that for each row:

# initialize list of dicts
collection = []

def write_typhoon(row: str, collection: Dict) -> Dict:
    if row[:5] == '66666':
       # do stuff
    else:
       # do other stuff

# read through lines list from the .readlines(), looping sequentially
for line in lines:
    write_typhoon(line, collection)

Lastly, we're going to need to write some logic to now extract that data out in some manner within the if/then loop inside the write_typhoon() function. I didn't care to do a whole lot of thinking here, and opted for the simplest I could make it: defining the fwf metadata myself. because "yolo":

def write_typhoon(row: str, collection: Dict) -> Dict:
    if row[:5] == '66666':
        typhoon = {
            "AA":row[:5],
            "BB":row[6:11],
            "CC":row[12:15],
            "DD":row[16:20],
            "EE":row[21:25],
            "FF":row[26:27],
            "GG":row[28:29],
            "HH":row[30:50],
            "II":row[51:],
            "data":[]
        }
        # clean that whitespace
        for key, value in typhoon.items():
            if key != 'data':
                typhoon[key] = value.strip()
        collection.append(typhoon)
    else:
        sub_data = {
            "A":row[:9],
            "B":row[9:12],
            "C":row[13:14],
            "D":row[15:18],
            "E":row[19:23],
            "F":row[24:32],
            "G":row[33:40],
            "H":row[41:42],
            "I":row[42:46],
            "J":row[47:51],
            "K":row[52:53],
            "L":row[54:57],
            "M":row[58:70],
            "P":row[71:]
        }
        # clean that whitespace
        for key, value in sub_data.items():
             sub_data[key] = value.strip()
        collection[-1]['data'].append(sub_data)
    return collection

Okay that took me longer than I'm willing to admit. I wont lie. Gave me PTSD flashbacks from writing COBOL programs...

Anyway, now we have a nice, nested data structure in native python types. The fun can begin!

Step 2: Load this into a usable format

To analyze it, I'm assuming you'll want it in pandas (or maybe Dask if its too big). Here is what I was able to come up with along that front:

import pandas as pd
df = pd.json_normalize(
    collection, 
    record_path='data', 
    meta=["AA","BB","CC","DD","EE","FF","GG","HH","II"] 
)

A great reference for that can be found in the answers for this question (particularly the second one, not the selected one)

Put it all together now:

from typing import Dict
import pandas as pd

# load the file as raw input
with open('./test.txt') as f:
    lines = f.readlines()

# initialize  list of dicts
collection = []

def write_typhoon(row: str, collection: Dict) -> Dict:
    if row[:5] == '66666':
        typhoon = {
            "AA":row[:5],
            "BB":row[6:11],
            "CC":row[12:15],
            "DD":row[16:20],
            "EE":row[21:25],
            "FF":row[26:27],
            "GG":row[28:29],
            "HH":row[30:50],
            "II":row[51:],
            "data":[]
        }
        for key, value in typhoon.items():
            if key != 'data':
                typhoon[key] = value.strip()
        collection.append(typhoon)
    else:
        sub_data = {
            "A":row[:9],
            "B":row[9:12],
            "C":row[13:14],
            "D":row[15:18],
            "E":row[19:23],
            "F":row[24:32],
            "G":row[33:40],
            "H":row[41:42],
            "I":row[42:46],
            "J":row[47:51],
            "K":row[52:53],
            "L":row[54:57],
            "M":row[58:70],
            "P":row[71:]
        }
        for key, value in sub_data.items():
             sub_data[key] = value.strip()
        collection[-1]['data'].append(sub_data)
    return collection
    

# read through file sequentially
for line in lines:
    write_typhoon(line, collection)

# load to pandas df using json_normalize
df = pd.json_normalize(
    collection, 
    record_path='data', 
    meta=["AA","BB","CC","DD","EE","FF","GG","HH","II"] 
)
print(df.head(20)) # lets see what we've got!

Wow, that's a great answer. I'm not working on climate change, unfortunately. I'm making a video game where natural disasters are generated on a realistic time-scale :) Also unfortunately, I figured out how to do it myself by the time I saw your answer, though your efforts are much appreciated. I've added my own solution, but will accept yours since you went to such great lengths :) — rocksNwaves, Mar 19 '21 at 16:39

score 0 · Answer 2 · answered Mar 18 '21 at 20:04

0

There's someone who might have had the same problem and created a library for it, you can check it out here: https://github.com/miniufo/besttracks

It also includes a quickstart notebook with loading the same dataset.

answered Mar 18 '21 at 20:04

Gilian

38
4

Excellent, I actually just found this myself. It is most impressive!!! If I didn't have so much work, I'd love to up my skills by recreating his library! – rocksNwaves Mar 18 '21 at 20:05

score 0 · Answer 3 · answered Mar 19 '21 at 16:36

Here is how I ended up doing it. The key was realizing there are two types of rows in the data, but within each type the columns are fixed width:

header_fmt = "AAAAA BBBB  CCC DDDD EEEE F G HHHHHHHHHHHHHHHHHHHH              IIIIIIII"
track_fmt  = "AAAAAAAA BBB C DDD EEEE FFFF     GGG     HIIII JJJJ KLLLL MMMM         P"

So, here's how it went. I wrote these two functions to help me reformat the text file int CSV format:



def get_idxs(string, char):
    idxs = []
    for i in range(len(string)):
        if string[i - 1].isalpha() and string[i] == char:
            idxs.append(i)
    return idxs
    
def replace(string, idx, replacement):
    string = list(string)
    try:
        for i in idx: string[i] = replacement
    except TypeError:
        string[idx] = replacement
    return ''.join(string)

# test it out
header_fmt = "AAAAA BBBB  CCC DDDD EEEE F G HHHHHHHHHHHHHHHHHHHH              IIIIIIII"
track_fmt  = "AAAAAAAA BBB C DDD EEEE FFFF     GGG     HIIII JJJJ KLLLL MMMM         P"

header_idxs = get_idxs(header_fmt, ' ')
track_idxs  = get_idxs(track_fmt, ' ')
print(replace(header_fmt, header_idxs, ','))
print(replace(track_fmt, track_idxs, ','))

Testing the function on the format strings, we see commas were put in the appropriate places:

AAAAA,BBBB, CCC,DDDD,EEEE,F,G,HHHHHHHHHHHHHHHHHHHH,             IIIIIIII
AAAAAAAA,BBB,C,DDD,EEEE,FFFF,    GGG,    HIIII,JJJJ,KLLLL,MMMM,        P

So next apply those functions to the .txt and create a .csv file with the output:

from contextlib import ExitStack
from tqdm.notebook import tqdm

with ExitStack() as stack:
    
    read_file  = stack.enter_context(open('data/bst_all.txt', 'r'))
    write_file = stack.enter_context(open('data/bst_all_clean.txt', 'a'))
    
    for line in tqdm(read_file.readlines()):
        if ' ' in line[:8]: # line is header data
            write_file.write(replace(line, header_idxs, ',') + '\n')
        else: # line is track data
            write_file.write(replace(line, track_idxs, ',') + '\n')

The next task is to add the header data to ALL rows, so that all rows have the same format:

header_cols = ['indicator', 'international_id', 'n_tracks', 'cyclone_id', 'international_id_dup', 
               'final_flag', 'delta_t_fin', 'name', 'last_revision']

track_cols = ['date', 'indicator', 'grade', 'latitude', 'longitude', 'pressure', 'max_wind_speed', 
              'dir_long50', 'long50', 'short50', 'dir_long30', 'long30', 'short30', 'jp_landfall']


data = pd.read_csv('data/bst_all_clean.txt', names=track_cols, skipinitialspace=True)

data.date = data.date.astype('string')

# Get headers. Header rows have variable 'indicator' which is 5 characters long.
headers = data[data.date.apply(len) <= 5] 
data[['storm_id', 'records', 'name']] = headers.iloc[:, [1, 2, 7]]

# Rearrange columns; bring identifiers to the first three columns. 
cols = list(data.columns[-3:]) + list(data.columns[:-3])
data = data[cols]

# front fill NaN's for header data
data[['storm_id', 'records', 'name']] = data[['storm_id', 'records', 'name']].fillna(method='pad')

# delete now extraneous header rows
data = data.drop(headers.index)

And that yields some nicely formatted data, like this:


    storm_id    records name    date        indicator   grade   latitude    longitude
15  5102.0      37.0    GEORGIA 51031900    2           2       67.0        1614
16  5102.0      37.0    GEORGIA 51031906    2           2       70.0        1625
17  5102.0      37.0    GEORGIA 51031912    2           2       73.0        1635

How to read space delimited data, two row types, no fixed width and plenty of missing values?

My question:

3 Answers3