0

I'm trying to convert text file to excel sheet in python. The txt file contains data in the below specified formart

Sample data

Column names: reg no, zip code, loc id, emp id, lastname, first name. Each record has one or more error numbers. Each record have their column names listed above the values. I would like to create an excel sheet containing reg no, firstname, lastname and errors listed in separate rows for each record.

How can I put the records in excel sheet ? Should I be using regular expressions ? And how can I insert error numbers in different rows for that corresponding record?

Expected output:

enter image description here

Here is the link to the input file: https://github.com/trEaSRE124/Text_Excel_python/blob/master/new.txt

Any code snippets or suggestions are kindly appreciated.

Katherine
  • 107
  • 1
  • 9
  • Take input from the file and create a data frame. Then follow: https://www.datacamp.com/community/tutorials/python-excel-tutorial – Ganesh Kathiresan Dec 20 '17 at 03:21
  • Hi @GaneshK Could you please be more specific about creating data frame ? Any code snippet will be helpful. – Katherine Dec 20 '17 at 03:27
  • You can convert input like: `entry1:[{'REG NO': 203}, {'ZIP CODE': 1234}...]`. This is basically a list with dicts of the column name to value. Then follow: http://pbpython.com/pandas-list-dict.html – Ganesh Kathiresan Dec 20 '17 at 03:34
  • PS: Just read first line, do `.split()` to get list of column names, then `.split()` of third line to get values. Create dicts of them. – Ganesh Kathiresan Dec 20 '17 at 03:35
  • I appreciate your effort @GaneshK. I have been trying to do the same but my problem is I'm able to create dictionary for 'reg no' but facing trouble to get error numbers as they are in different lines. – Katherine Dec 20 '17 at 04:07
  • One way would be to check if starting of the input line is `ERROR NUM` and append to `ERROR` columns list. Have a go, else upload the .txt [maybe as a drive link] and I'll convert and post the code as answer by tonight. – Ganesh Kathiresan Dec 20 '17 at 04:23
  • Is the data formatted consistently? For example... - Are all header lines identical? If not, how do they differ? - Do values ever extend past their columns? If so, do they impinge upon the next column? - Is there always a section containing ERRORs for each record? If so, is it always formatted the same way? – userNaN Dec 20 '17 at 04:30
  • In the input file, do valid, useful lines of data *always* begin (except possibly for one or more blanks) with an integer? – Bill Bell Dec 20 '17 at 04:32
  • Now I see, you want the errors as well. – Bill Bell Dec 20 '17 at 04:39
  • @GaneshK here is the link to the input file https://github.com/trEaSRE124/Text_Excel_python/blob/master/new.txt – Katherine Dec 20 '17 at 04:58
  • @BillBell Please look at the link for input file. Input data starts from 14 till line 37. All usefull lines start with integer followed by a blank and error details. After error lines, again starts heading line followed by data. – Katherine Dec 20 '17 at 05:06
  • @userNaN All rows are identical and there errors for every record but do not have a heading called (ERRORS) unlike other columns which have headings. – Katherine Dec 20 '17 at 05:07
  • You can do it by using the columns headers as splitting points, something like `document.read().split('REG NO ....... -----')`, it'll split the rows into elements of a list, now parse the list using RE or some string logic and finally get an multidimensional array and the write them as comma separated values and you're good to go! – Ubdus Samad Dec 20 '17 at 17:24

2 Answers2

2

Here is a draft code. Let me know if any changes needed:

# import pandas as pd
from collections import OrderedDict
from datetime import date
import csv

with open('in.txt') as f:
    with open('out.csv', 'wb') as csvfile:
        spamwriter = csv.writer(csvfile, delimiter=',', quoting=csv.QUOTE_MINIMAL)
        #Remove inital clutter
        while("INPUT DATA" not in f.readline()):
            continue 

        header = ["REG NO", "ZIP CODE", "LOC ID", "EMP ID", "LASTNAME", "FIRSTNAME", "ERROR"]; data = list(); errors = list()
        spamwriter.writerow(header)
        print header

        while(True):
            line = f.readline()
            errors = list()
            if("END" in line):
                exit()
            try:
                int(line.split()[0])
                data = line.strip().split()
                f.readline() # get rid of \n
                line = f.readline()
                while("ERROR" in line):
                    errors.append(line.strip())
                    line = f.readline()
                spamwriter.writerow(data + errors)
                spamwriter.flush() 
            except:
                continue


        # while(True):
            # line = f.readline()

Use python-2 to run. The errors are appended as subsequent columns. It's slightly complicated the way you want it. I can fix it if still needed

Output looks like: enter image description here

Ganesh Kathiresan
  • 2,068
  • 2
  • 22
  • 33
1

You can do this using the openpyxl library which is capable of depositing items directly into a spreadsheet. This code shows how to do that for your particular situation.

NEW_PERSON, ERROR_LINE = 1,2
def Line_items():
    with open('katherine.txt') as katherine:
        for line in katherine:
            line = line.strip()
            if not line:
                continue
            items = line.split()
            if items[0].isnumeric():
                yield NEW_PERSON, items
            elif items[:2] == ['ERROR', 'NUM']:
                yield ERROR_LINE, line
            else:
                continue

from openpyxl import Workbook
wb = Workbook()
ws = wb.active

ws['A2'] = 'REG NO'
ws['B2'] = 'LASTNAME'
ws['C2'] = 'FIRSTNAME'
ws['D2'] = 'ERROR'

row = 2
for kind, data in Line_items():
    if kind == NEW_PERSON:
        row += 2
        ws['A{:d}'.format(row)] = int(data[0])
        ws['B{:d}'.format(row)] = data[-2]
        ws['C{:d}'.format(row)] = data[-1]
        first = True
    else:
        if first:
            first = False
        else:
            row += 1
        ws['D{:d}'.format(row)] = data

wb.save(filename='katherine.xlsx')

This is a screen snapshot of the result.

spreadsheet

Bill Bell
  • 21,021
  • 5
  • 43
  • 58