I have multiple source Excel files that I need to combine into one. All source files have only one column that has data formatted as General and values in the rows look like this:
D0200000001104566 000153536 Q773 00000
D0200000001104566 000157902 Q322 00000
D0200000001104566 143253486 Q339 00000
There are more numbers and letters after the zeroes with varying number of spaces between them for a total of 700 characters in each row but didn't think it was necessary to show that here. The goal is to have the output Excel file with two columns that has data formatted like this:
A B
104566 000153536
104566 000157902
104566 143253486
Using the following code I found here, I was able to combine the Excel files into a dataframe and then publish the Excel file:
import os
import glob
import pandas as pd
import numpy as np
#setting variables
path = <file path>
allFiles = glob.glob.(os.path.join(path,"*.xlsx"))
outfile = <file path>
#looping through excel files to merge into one file
np_array_list = []
for file_ in allFiles:
df = pd.read_excel(file_, index_col=None, header=0)
df = df[:-1]
np_array_list.append(df.as_matrix())
comb_np_array = np.vstack(np_array_list)
big_frame = pd.DataFrame(comb_np_array), columns=['A'])
#publishing dataframe as one excel file
print("Hang in there...should be done quick")
big_frame.to_excel(outfile,index=None)
I'm having trouble with 2 things.
1. Figuring out how to split the big_frame's one column into 2 so that the output Excel file will have the desired format. I have tried adding the following line right before publishing the Excel file,
big_frame = big_frame.A.str.split(" ", expand=True)
tried converting column A into a string before attempting to split (but this was highly inefficient!!!), and then a few other suggestions but have not been successful.
2. Deleting the first 11 characters.
Should I be using a completely different approach than this script to even combine my Excel files into a dataframe or can this script be modified to accomplish my goal efficiently?