How to parse this text file?

Question

                                                    VENDOR
ITEM NUMBER           WH ITEM DESCRIPTION               PRODUCT NUMBER      PRICE    DISC %   MAIN-WH    ALT-WH  BIN#
--------------- ----- -- ------------------------------ --------------- ---------    ------ --------- --------- ------
                                                                             0.00 EA   0.00         0         0

10.5PLC/TLED/26V/27K     14.5W 4PIN CFL REPL 2700K VERT 458406              20.00 EA   0.00         0         0        I68    I68
                                                         (00029  )

10.5PLC/TLED/26V/30K     14.5W 4PIN CFL REPL 3000K VERT 458414              20.00 EA   0.00         3         0 PAYOFF I68    I68
                                                         (00029  )

10.5PLC/TLED/26V/35K     14.5W 4PIN CFL REPL 3500K VERT 458422              20.00 EA   0.00         0         0        I68    I68
                                                         (00029  )

10.5PLC/TLED/26V/40K     14.5W 4PIN CFL REPL 4000K VERT 458430              20.00 EA   0.00         0         0        I68    I68
                                                         (00029  )

I want to read each line item and get the item Number, description, Vendor product number and price.

I tried using this python code

def readInventoryFile():
    # dataFile = open("inventoryFiles/INV.txt","r")
    with open('inventoryFiles/INV.txt') as dataFile:
        for lineItem in dataFile:
            itemProperties = lineItem.split("   ")
            while("" in itemProperties) :
                itemProperties.remove("")
            print(itemProperties)
            try:
                itemNum = itemProperties[0]
                itemDesc = itemProperties[1]
                partNumb = itemProperties[2]
                price = itemProperties[3]

                itemSummry = {
                    "Name": itemDesc,
                    "Price": price,
                    "PN": partNumb,
                }

                print(lineItem, "\n ",itemProperties,"\n Summary ",itemSummry)
            except Exception as e:
                print(e)

The code partially works but it's hard to split the line by spaces or other factors because there are spaces that separated and within the content of each line. How could I get the desired product properties?

Short answer is that splitting on spaces won't work so well for that file. Suggestion: Try `pandas` and reading "fixed-width format"... https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_fwf.html and even then, you'll have to parse out some data — OneCricketeer, Sep 11 '19 at 22:30
As the fields start from fixed position in each line, you could just use slices at the level. If you want to parse subfields within each field, you then use split. — yoonghm, Sep 11 '19 at 22:38
The file uploaded by you have wrong formatting. Please correct before other could help you. — yoonghm, Sep 11 '19 at 22:44
Is `VENDOR` part of `PRODUCT NUMBER`. Is `(00029 )` belonged to `BIN#`. The first line has missing fields, is it correct? The file has 10 connected dashed but it has less number of header fields. What i see may not be what you see though. Perhaps you could explain number of fields and positions a field starts and end ends. — yoonghm, Sep 11 '19 at 23:04
VENDOR is part of PRODUCT NUMBER. 00029 is on a separate line below and can be ignored — e.iluf, Sep 11 '19 at 23:28

martineau · Accepted Answer · 2019-09-13T15:51:37.537

I think my answer to the question How to efficiently parse fixed width files? can be adapted to do what you want.

The main modification to the code in that answer was to make it also strip any leading and trailing spaces in each field. Below is Python 3.x code illustrating this:

from __future__ import print_function
import struct


HEADER_LINES = 5

# Indices       0       1        2      3      4      5       6      7
fieldwidths = (20, -5, 37, -10, 12, -1, 6, -1, 9, -1, 9, -1, 10, -1, 7)

# Convert fieldwidths into a format compatible with struct module.
fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
                                    for fw in fieldwidths)
fieldstruct = struct.Struct(fmtstring)
#print('fmtstring: {!r}, recsize: {} chars\n'.format(fmtstring, fieldstruct.size))

unpack_from = fieldstruct.unpack_from  # To optimize calls.


def parse(line):
    """ Return unpacked fields in string line, stripped of any leading and
        trailing whitespace.
    """
    return list(s.decode().strip() for s in unpack_from(line.encode()))


def readInventoryFile(filename):
    with open(filename) as invfile:
        for _ in range(HEADER_LINES):
            next(invfile)  # Skip header lines.

        for line in invfile:
            if len(line) < fieldstruct.size:  # Pad line if it's too short.
                line = line + (' ' * (fieldstruct.size-len(line)))
            fields = parse(line)
            if fields[0]:  # First field non-blank?
                print(fields)

readInventoryFile('inventoryFiles_INV.txt')

Results:

['10.5PLC/TLED/26V/27K', '14.5W 4PIN CFL REPL 2700K VERT 458406', '20.00 EA', '0.00', '0', '0', 'I68', 'I68']
['10.5PLC/TLED/26V/30K', '14.5W 4PIN CFL REPL 3000K VERT 458414', '20.00 EA', '0.00', '3', '0', 'PAYOFF I68', 'I68']
['10.5PLC/TLED/26V/35K', '14.5W 4PIN CFL REPL 3500K VERT 458422', '20.00 EA', '0.00', '0', '0', 'I68', 'I68']
['10.5PLC/TLED/26V/40K', '14.5W 4PIN CFL REPL 4000K VERT 458430', '20.00 EA', '0.00', '0', '0', 'I68', 'I68']
['1000PAR64/FFR', '1000W PAR64 HALOGEN GX16D BASE 56217', '50.00 EA', '0.00', '0', '0', 'I10', '']
['1000PAR64/WFL/S', '1000W PAR64 HALOGEN GX16D BASE S4673', '0.00 EA', '0.00', '0', '0', '', 'I105']
['100A/99', '100W A19 EXTENDED SERVICE      229781', '2.62 EA', '0.00', '0', '0', 'W6-2   I70', 'I11']
['100A/CL', '100W A19 130V CLEAR            375279', '0.99 EA', '0.00', '0', '0', 'A2-2   I70', 'I11']

How this works

In a nutshell, this code leverages Python's struct module ability to split or "unpack" a "buffer" full of data into fixed-with "fields" each containing a certain number of characters.

Although more commonly applied to binary data, it also works on character strings that have been converted into arrays of bytes (which isn't necessary in Python 2.x). Basically you give it a format string specifying the characteristics of each these fields (type and size), along with the data to parse (a line from the file in this case), which it then unpacks accordingly and returns the results as a list of values.

amazing! could you please explain what is going on in the code? — e.iluf, Sep 12 '19 at 02:20

score 0 · Answer 2 · answered Sep 12 '19 at 00:12

Since your file is formatted in an inconvenient way I can only see this trick as solution: You can look at the headers (feature names) and parse all the lines according to the indices at which they appear, like such

import numpy as np
with open('/Users/Copo1/Desktop/aaa.txt') as dataFile:
    lines = dataFile.readlines()
headers = ['ITEM NUMBER','WH ITEM DESCRIPTION', 'PRODUCT NUMBER', 'PRICE']
starts = [lines[1].find(h) for h in headers]
starts.append(len(lines[0]))
headers.append(' ')
items = [[line[starts[i]:starts[i+1]] for line in lines] for i,h in zip(range(len(starts)-1), headers[:-1]) ]

this produces the following output for the items list (pasting only the first element, corresponding to 'ITEM NUMBER', the other elements are correct as well you can check).

[['                      ',
  'ITEM NUMBER           ',
  '--------------- ----- ',
  '                      ',
  ' \n',
  '10.5PLC/TLED/26V/27K  ',
  '                      ',
  ' \n',
  '10.5PLC/TLED/26V/30K  ',
  '                      ',
  ' \n',
  '10.5PLC/TLED/26V/35K  ',
  '                      ',
  ' \n',
  '10.5PLC/TLED/26V/40K  ',
  '                      ',
  ' \n',
  '1000PAR64/FFR         ',
  '                      ',
  ' \n',
  '1000PAR64/WFL/S       ',
  '                      ',
  ' \n',
  '100A/99               ',
  '                      ',
  ' \n',
  '100A/CL               ',
  '                      ',
  ' '],

There might be some extra easy polishing left to do after this (like removing empty strings and '\n's) but I am sure you can figure out yourself.

How to parse this text file?

2 Answers2

How this works