-1

I have multiple source Excel files that I need to combine into one. All source files have only one column that has data formatted as General and values in the rows look like this:

D0200000001104566   000153536                Q773                                     00000
D0200000001104566   000157902                Q322                                     00000
D0200000001104566   143253486                Q339                                     00000

There are more numbers and letters after the zeroes with varying number of spaces between them for a total of 700 characters in each row but didn't think it was necessary to show that here. The goal is to have the output Excel file with two columns that has data formatted like this:

      A             B
104566 000153536
104566 000157902
104566 143253486

Using the following code I found here, I was able to combine the Excel files into a dataframe and then publish the Excel file:

import os
import glob
import pandas as pd
import numpy as np

#setting variables
path = <file path>
allFiles = glob.glob.(os.path.join(path,"*.xlsx"))
outfile = <file path>

#looping through excel files to merge into one file    
np_array_list = []
for file_ in allFiles:
    df = pd.read_excel(file_, index_col=None, header=0)
    df = df[:-1]
    np_array_list.append(df.as_matrix())
comb_np_array = np.vstack(np_array_list)
big_frame = pd.DataFrame(comb_np_array), columns=['A'])

#publishing dataframe as one excel file
print("Hang in there...should be done quick")
big_frame.to_excel(outfile,index=None)

I'm having trouble with 2 things.

1. Figuring out how to split the big_frame's one column into 2 so that the output Excel file will have the desired format. I have tried adding the following line right before publishing the Excel file,

big_frame = big_frame.A.str.split(" ", expand=True)

tried converting column A into a string before attempting to split (but this was highly inefficient!!!), and then a few other suggestions but have not been successful.
2. Deleting the first 11 characters.

Should I be using a completely different approach than this script to even combine my Excel files into a dataframe or can this script be modified to accomplish my goal efficiently?

  • Can you explain `tried converting column A into a string before attempting to split (but this was highly inefficient!!!),` ? Because if column `A` contains values with spaces like your sample it is `string` column. Or not? Also can you check my edit if working with your data? – jezrael Apr 22 '18 at 07:01
  • When I checked the datatype of that column, it was object, that is why I think str.split wasn't working. Total size of my dataset is around 100k rows. Because each row has 700 characters with multiple text strings in there, converting the column into string and then splitting them by spaces results in 611 columns. With about 100k rows and so many columns, I believe that is why it's slowing down. I'm sure I can then extract only the desired columns, combine them, and then remove the 11 characters in the beginning but this process seems really slow. I'm checking your edit now. – techscolasticus Apr 22 '18 at 13:18
  • `object` obviolsly means `string`, you can check it by `print (df['A'].head().apply(type))`, so if it is string, not converting necessary. – jezrael Apr 22 '18 at 13:20
  • This is strange. I used `big_frame = big_frame.A.str.split(" ", expand=True)` yesterday and Spyder wasn't even splitting the columns. Anyway, so tried it right and column was split into 611 columns but just with 21.5k rows, combining the files into one took 2 min 33 sec. I haven't even taken the steps to get the output in the desired format. There must be a faster way? – techscolasticus Apr 22 '18 at 13:45
  • I have no idea about faster way if all data are in one column in excel. – jezrael Apr 22 '18 at 13:47

1 Answers1

1

I think need remove first 11 values by str[] first and then use split with no separator, because whitespace is default sep:

df = big_frame['A'].str[11:].str.split(expand=True)[[0,1]]
df.columns = ['A','B']
print (df)
        A          B
0  104566  000153536
1  104566  000157902
2  104566  143253486

If no NaNs is possible use list comprehension:

df = pd.DataFrame([x[11:].split()[:2] for x in big_frame['A']], columns=['A','B'])

EDIT:

You can also use list comprehension with concat for big DataFrame:

big_frame = pd.concat([pd.read_excel(f, index_col=None, header=0)[:-1] for f in allFiles])
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thank jezrael. I didn't use your edit, the original answer was fast enough. Thanks again!!! – techscolasticus Apr 22 '18 at 13:49
  • @techscolasticus - Glad can help! – jezrael Apr 22 '18 at 13:50
  • do you @jezrael know what aspect about my question would cause downvotes? Since asking, this has been downvoted twice. Not saying that you downvoted them but in general, since you've been around for a while, how could i have made this question better so that it doesn't get downvoted? – techscolasticus May 14 '18 at 18:21
  • @techscolasticus In my opinion there is problem to many text. So people want [mcve](https://stackoverflow.com/help/mcve) and if not, too fast downvote. How create good question, good sample data is explained [here](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – jezrael May 14 '18 at 18:52