Txt file to excel conversion in python

Question

I'm trying to convert text file to excel sheet in python. The txt file contains data in the below specified formart

Column names: reg no, zip code, loc id, emp id, lastname, first name. Each record has one or more error numbers. Each record have their column names listed above the values. I would like to create an excel sheet containing reg no, firstname, lastname and errors listed in separate rows for each record.

How can I put the records in excel sheet ? Should I be using regular expressions ? And how can I insert error numbers in different rows for that corresponding record?

Expected output:

Here is the link to the input file: https://github.com/trEaSRE124/Text_Excel_python/blob/master/new.txt

Any code snippets or suggestions are kindly appreciated.

Take input from the file and create a data frame. Then follow: https://www.datacamp.com/community/tutorials/python-excel-tutorial — Ganesh Kathiresan, Dec 20 '17 at 03:21
Hi @GaneshK Could you please be more specific about creating data frame ? Any code snippet will be helpful. — Katherine, Dec 20 '17 at 03:27
You can convert input like: `entry1:[{'REG NO': 203}, {'ZIP CODE': 1234}...]`. This is basically a list with dicts of the column name to value. Then follow: http://pbpython.com/pandas-list-dict.html — Ganesh Kathiresan, Dec 20 '17 at 03:34
PS: Just read first line, do `.split()` to get list of column names, then `.split()` of third line to get values. Create dicts of them. — Ganesh Kathiresan, Dec 20 '17 at 03:35
I appreciate your effort @GaneshK. I have been trying to do the same but my problem is I'm able to create dictionary for 'reg no' but facing trouble to get error numbers as they are in different lines. — Katherine, Dec 20 '17 at 04:07
One way would be to check if starting of the input line is `ERROR NUM` and append to `ERROR` columns list. Have a go, else upload the .txt [maybe as a drive link] and I'll convert and post the code as answer by tonight. — Ganesh Kathiresan, Dec 20 '17 at 04:23
Is the data formatted consistently? For example... - Are all header lines identical? If not, how do they differ? - Do values ever extend past their columns? If so, do they impinge upon the next column? - Is there always a section containing ERRORs for each record? If so, is it always formatted the same way? — userNaN, Dec 20 '17 at 04:30
In the input file, do valid, useful lines of data *always* begin (except possibly for one or more blanks) with an integer? — Bill Bell, Dec 20 '17 at 04:32
@GaneshK here is the link to the input file https://github.com/trEaSRE124/Text_Excel_python/blob/master/new.txt — Katherine, Dec 20 '17 at 04:58
@BillBell Please look at the link for input file. Input data starts from 14 till line 37. All usefull lines start with integer followed by a blank and error details. After error lines, again starts heading line followed by data. — Katherine, Dec 20 '17 at 05:06
@userNaN All rows are identical and there errors for every record but do not have a heading called (ERRORS) unlike other columns which have headings. — Katherine, Dec 20 '17 at 05:07
You can do it by using the columns headers as splitting points, something like `document.read().split('REG NO ....... -----')`, it'll split the rows into elements of a list, now parse the list using RE or some string logic and finally get an multidimensional array and the write them as comma separated values and you're good to go! — Ubdus Samad, Dec 20 '17 at 17:24

Ganesh Kathiresan · Answer 1 · 2017-12-20T17:08:51.233

2

Here is a draft code. Let me know if any changes needed:

# import pandas as pd
from collections import OrderedDict
from datetime import date
import csv

with open('in.txt') as f:
    with open('out.csv', 'wb') as csvfile:
        spamwriter = csv.writer(csvfile, delimiter=',', quoting=csv.QUOTE_MINIMAL)
        #Remove inital clutter
        while("INPUT DATA" not in f.readline()):
            continue 

        header = ["REG NO", "ZIP CODE", "LOC ID", "EMP ID", "LASTNAME", "FIRSTNAME", "ERROR"]; data = list(); errors = list()
        spamwriter.writerow(header)
        print header

        while(True):
            line = f.readline()
            errors = list()
            if("END" in line):
                exit()
            try:
                int(line.split()[0])
                data = line.strip().split()
                f.readline() # get rid of \n
                line = f.readline()
                while("ERROR" in line):
                    errors.append(line.strip())
                    line = f.readline()
                spamwriter.writerow(data + errors)
                spamwriter.flush() 
            except:
                continue


        # while(True):
            # line = f.readline()

Use python-2 to run. The errors are appended as subsequent columns. It's slightly complicated the way you want it. I can fix it if still needed

Output looks like:

edited Dec 20 '17 at 17:08

answered Dec 20 '17 at 14:12

Ganesh Kathiresan

2,068
2
22
33

Sorry for the delay. I had an exam – Ganesh Kathiresan Dec 20 '17 at 14:13
You can skip the initial parts. I assumed the dataset will be as you uploaded. – Ganesh Kathiresan Dec 20 '17 at 14:31
I tried running your program, but it is taking forever to run the program and doesnot write the data to output file. I am running it in Python 2 using jupyter notebook. – Katherine Dec 20 '17 at 15:12
Probably an infinite loop. Like is the format exactly same? Add a few print statements here and there. Else I'll modify it. – Ganesh Kathiresan Dec 20 '17 at 16:44
Yes I did that. I have added print statements like this `int(line.split()[0]) print int(line.split()[0]) data = line.strip().split() print data`. – Katherine Dec 20 '17 at 16:50
It seems to print out all the statements but doesnot write items into file. Even have included couple of print statements below writerow(). Prints them on console. – Katherine Dec 20 '17 at 17:00
1

Oh I think this is the cause: https://stackoverflow.com/questions/3976711/csvwriter-not-saving-data-to-file-why – Ganesh Kathiresan Dec 20 '17 at 17:06

score 1 · Accepted Answer · answered Dec 20 '17 at 17:04

You can do this using the openpyxl library which is capable of depositing items directly into a spreadsheet. This code shows how to do that for your particular situation.

NEW_PERSON, ERROR_LINE = 1,2
def Line_items():
    with open('katherine.txt') as katherine:
        for line in katherine:
            line = line.strip()
            if not line:
                continue
            items = line.split()
            if items[0].isnumeric():
                yield NEW_PERSON, items
            elif items[:2] == ['ERROR', 'NUM']:
                yield ERROR_LINE, line
            else:
                continue

from openpyxl import Workbook
wb = Workbook()
ws = wb.active

ws['A2'] = 'REG NO'
ws['B2'] = 'LASTNAME'
ws['C2'] = 'FIRSTNAME'
ws['D2'] = 'ERROR'

row = 2
for kind, data in Line_items():
    if kind == NEW_PERSON:
        row += 2
        ws['A{:d}'.format(row)] = int(data[0])
        ws['B{:d}'.format(row)] = data[-2]
        ws['C{:d}'.format(row)] = data[-1]
        first = True
    else:
        if first:
            first = False
        else:
            row += 1
        ws['D{:d}'.format(row)] = data

wb.save(filename='katherine.xlsx')

This is a screen snapshot of the result.

Txt file to excel conversion in python

2 Answers2