2

So far for my code to read from text files and export to Excel I have:

import glob

data = {}
for infile in glob.glob("*.txt"):
    with open(infile) as inf:
        data[infile] = [l[:-1] for l in inf] 

with open("summary.xls", "w") as outf:
    outf.write("\t".join(data.keys()) + "\n")
    for sublst in zip(*data.values()):
        outf.write("\t".join(sublst) + "\n")

The goal with this was to reach all of the text files in a specific folder.

However, when I run it, Excel gives me an error saying,

"File cannot be opened because: Invalid at the top level of the document. Line 1, Position 1. outputgooderr.txt outputbaderr.txt. fixed_inv.txt

Note: outputgooderr.txt, outputbaderr.txt.,fixed_inv.txt are the names of the text files I wish to export to Excel, one file per sheet.

When I only have one file for the program to read, it is able to extract the data. Unfortunately, this is not what I would like since I have multiple files.

Please let me know of any ways I can combat this. I am very much so a beginner in programming in general and would appreciate any advice! Thank you.

Ruth
  • 71
  • 2
  • 9

1 Answers1

1

If you're not opposed to having the outputted excel file as a .xlsx rather than .xls, I'd recommend making use of some of the features of Pandas. In particular pandas.read_csv() and DataFrame.to_excel()

I've provided a fully reproducible example of how you might go about doing this. Please note that I create 2 .txt files in the first 3 lines for the test.

import pandas as pd
import numpy as np
import glob

# Creating a dataframe and saving as test_1.txt/test_2.txt in current directory
# feel free to remove the next 3 lines if yo want to test in your directory
df = pd.DataFrame(np.random.randn(10, 3), columns=list('ABC'))
df.to_csv('test_1.txt', index=False)
df.to_csv('test_2.txt', index=False)

txt_list = [] # empty list
sheet_list = [] # empty list

# a for loop through filenames matching a specified pattern (.txt) in the current directory
for infile in glob.glob("*.txt"): 
    outfile = infile.replace('.txt', '') #removing '.txt' for excel sheet names
    sheet_list.append(outfile) #appending for excel sheet name to sheet_list
    txt_list.append(infile) #appending for '...txt' to txtt_list

writer = pd.ExcelWriter('summary.xlsx', engine='xlsxwriter')

# a for loop through all elements in txt_list
for i in range(0, len(txt_list)):
    df = pd.read_csv('%s' % (txt_list[i])) #reading element from txt_list at index = i 
    df.to_excel(writer, sheet_name='%s' % (sheet_list[i]), index=False) #reading element from sheet_list at index = i 

writer.save()

Output example:

Expected Output

patrickjlong1
  • 3,683
  • 1
  • 18
  • 32
  • 1
    Thank you! :) Since I have three files I would like to extract information from, would I use something like df = ('file1.txt', 'file2.txt', 'file3.txt') instead of the three lines you put? Hope this isn't a stupid question! – Ruth Feb 13 '18 at 19:42
  • 1
    There's no need to do that. If you remove those 3 lines and run the script from the directory containing the .txt files that you want to convert into one excel file, it should work as is. Alternatively if you want to run the python script from a different directory you could use the `os` module to change directory - `os.chdir(r"c:\some\folder")`. Whichever you choose to do, it might be helpful to print the two lists (txt_list and sheet_list) to convince yourself of how this is working. – patrickjlong1 Feb 13 '18 at 20:44
  • Okay! Tried that but I get this error for some reason: **pandas.errors.ParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 9 Exception Exception: Exception('Exception caught in workbook destructor. Explici t close() may be required for workbook.',) in > ignored.** Thanks again for all of your help, and let me know if you know anything about this! – Ruth Feb 13 '18 at 21:09
  • 1
    I’m pretty sure it’s having an issue reading one of the .txt files. It looks like it could be something to do with the delimiter. This answer - https://stackoverflow.com/questions/18039057/python-pandas-error-tokenizing-data has some good info on some of the solutions associated with this Error. Hope that is helpful. – patrickjlong1 Feb 13 '18 at 21:29
  • 2
    FWIW going from CSV to XLSX via Pandas is inefficient because it means going from rows to columns back to rows again. Both openpyxl and xlsxwriter support streaming and it makes more sense to use them directly, especially if the files are large. – Charlie Clark Feb 14 '18 at 12:37
  • @patrickjlong1 Even with fixing that error, many more still pop up including: **data, consumed = self.encode(object, self.errors) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 92: ordinal not in range(128).** Please let me know if you get errors too when removing the three lines! – Ruth Feb 14 '18 at 15:45
  • 1
    @Ruth - This is likely an issue reading one or several of the .txt files in your directory. It's likely something to do with the delimiter being used. Perhaps explore some of the optional parameters with pandas.read_csv(). – patrickjlong1 Feb 14 '18 at 15:53
  • @patrickjlong1 Oh okay, got it to work! I would greatly appreciate if you could take a minute or two to comment out an explanation for your example code so I can understand and hopefully replicate it on my own. Thanks so much! – Ruth Feb 14 '18 at 16:25
  • @patrickjlong1 Is the code written in such a way that the first row in Excel is bolded and centered? If so, how do I get rid of that? – Ruth Feb 14 '18 at 16:39
  • 1
    @Ruth - I think this answer ( https://stackoverflow.com/questions/36694313/pandas-xlsxwriter-format-header ) should point you in the right direction on the bold format header front. Bold header is the default. – patrickjlong1 Feb 14 '18 at 16:44
  • @patrickjlong1 Do you could take a minute or two to comment out an explanation for your example code so I can understand it? – Ruth Feb 16 '18 at 19:10
  • 1
    @Ruth - I've just added several comments throughout the code to help you understand it. – patrickjlong1 Feb 16 '18 at 20:21