0

I have a lot of excel files (around 30K), each of them has attributes of items(one excel having 3K rows each). There are some columns which exist in all the excel files. Also, there are more columns which might not exist in all of them. I want to merge them to a single data frame.

I tried using pandas.read_excel to read each data frame, and then merging them through pandas.append, this is not only very slow but also fails for some of the files.

Code used:

all_data = pd.DataFrame()
dfs = []
for f in glob.glob("sheets_*.xlsx"):
    temp = pd.read_excel(f, sheetname='ItemDetail', skiprows=[0, 2],index_col=0)
    temp = clean_data(temp) # Do some cleaning here.
    dfs.append(temp)

all_data = all_data.append(dfs,ignore_index=True)

Example:-

Excel 1

| Item Id   | Source  | country   | Item Name   |Item Weight   |Cost |
----------------------------------------------------------------------
|   1       |   x     | India     |   Pen       |     10       | 100 |
|   2       |   y     | Australia |   Pencil    |     15       | 50  | 
|   3       |   x     | Germany   |   Eraser    |      5       | 20  |
|   4       |   y     | India     |   Box       |     80       | 200 |
----------------------------------------------------------------------

Excel 2

| Item Id   | Source  | country   | Item Name   |Item Weight   |Length| Width |
|-----------------------------------------------------------------------------|
|   1       |   x     | Australia |   chair     |     100      | 20   |   26  |
|   2       |   y     | Australia |   cupboard  |     150      | 30   |   40  |
|   3       |   x     | Germany   |   Table     |      500     | 60   |   50  |
|   4       |   y     | Germany   |   Tool      |     360      | 20   |   80  |
|-----------------------------------------------------------------------------|

Final Merged Data:

| Item Id   | Source  | country   | Item Name   |Item Weight   |Length| Width | Cost |
|------------------------------------------------------------------------------------|
|   10      |   x     | Australia |   chair     |     100      | 20   |   26  |  NAN |
|   26      |   y     | Australia |   cupboard  |     150      | 30   |   40  |  NAN |
|   38      |   x     | Germany   |   Table     |     500      | 60   |   50  |  NAN |
|   41      |   y     | Germany   |   Tool      |     360      | 20   |   80  |  NAN |
|   1       |   x     | India     |   Pen       |      10      | NAN  |  NAN  |  100 |
|   2       |   y     | Australia |   Pencil    |      15      | NAN  |  NAN  |  50  |
|   3       |   x     | Germany   |   Eraser    |       5      | NAN  |  NAN  |  20  |
|   4       |   y     | India     |   Box       |      80      | NAN  |  NAN  |  200 |
|------------------------------------------------------------------------------------|

Note that in this example there are columns like Item Id , Source and country which is present in all of them, but ther columns might not be present in all of them.

Also number of columns in original data is around 150. Number of rows in each sheet is around 3000, and I have around 35K such sheets. So I am looking best way to load all these data into pandas.

sukhwant prafullit
  • 351
  • 1
  • 4
  • 12

0 Answers0