1

I'm new to pandas and python so ran into some trouble. I have a one large excel file which i need to divide into multiple worksheets using a python script. That i have to divide base on the ip addresses given in the data. I can't figure out how to do that and would appreciate some help and guidance. I have no knowledge of working with python or any libraries before. This is what i did but created workbooks for each row.

import pandas as pd
df = pd.read_excel("D:/Users/Zakir/Desktop/MyNotebooks/Legacy.xls", sheet_name="Total", header=0, names=None, index_col=None, parse_cols=None, usecols=None, squeeze=False, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skiprows=None, nrows=None, na_values=None, keep_default_na=True, verbose=False, parse_dates=False, date_parser=None, thousands=None, comment=None, skipfooter=0, convert_float=True, mangle_dupe_cols=True)

writer = pd.ExcelWriter('D:/Users/Zakir/Desktop/MyNotebooks/pandas_simple.xlsx', engine='xlsxwriter')
for index, row in df.iterrows():
    df1 = df.iloc[[index]]
    df1.set_index('Number',inplace=True)
    df1.to_excel(writer,  sheet_name=row['IPAddress'])
writer.save()

This is the kind of excel file i have. over 5000 rows. There are 60 groups of ip addresses and have to divide each group into its own worksheet

Mahadzk
  • 21
  • 3
  • It is easier to use the library `openpyxl` to do so. https://openpyxl.readthedocs.io/en/stable/. Also, you should provide what you have done and the structure of your Excel file so that we aren't working on air here. Otherwise, I could simply say `for i in df['ip_address']: wb.create_sheet(i)` – ycx Mar 05 '19 at 06:17
  • @ycx thankyou for the reply. Yes sorry i've edited the question, not sure i did any better a job. My apologies first time using stackoverflow so getting used to it. Your help would be greatly appreciated. I'm still not sure if i've explained the situation well enough – Mahadzk Mar 05 '19 at 06:56
  • A manual method in excel would be to sort then copy paste... Or duplicate the file and sort, then delete the ones not wanted , rinse repeat... – Solar Mike Mar 05 '19 at 06:58
  • 1
    @SolarMike Thankyou for your input! Yes i've competed the task manually using kutools but the requirement is do write a script to perform it so stuck with this :) – Mahadzk Mar 05 '19 at 07:12
  • So consider vba... you know a method get excel vba to repeat it... – Solar Mike Mar 05 '19 at 07:17
  • above code's for-loop is reading row by row and writing to excel. so you are getting. Instead you would want to filter first, then save the data to multiple workbooks. If you provide sample input data, which is not filtered, and sample output data which is after filter with different tables, then people also can help. please put sample input or output with less rows content, so it is easy to read. – Hara Mar 05 '19 at 07:30
  • jus fyi, i see mangle_dupe_cols=True in your settings, i think its bugged – Frenchy Mar 05 '19 at 07:58
  • if you could read and put your data in a DataFrame without memory problem, its really easy, do this test before – Frenchy Mar 05 '19 at 08:01

2 Answers2

2

one solution if you have enough memory:

from pandas import ExcelWriter
df = pd.read_excel('file',sheet_name="Total", header=0, #other settings.....#)
writer = ExcelWriter('E:/output.xlsx',engine='xlsxwriter')
print(df)
def writesheet(g):
    a = g['IPAddress'].tolist()[0]
    g.to_excel(writer, sheet_name=str(a), index=False)# index = True if you want to keep index


df.groupby('IPAddress').apply(writesheet)
writer.save()
Frenchy
  • 16,386
  • 3
  • 16
  • 39
0

This is how I implemented the code to check a folder, loop through all excel files and split each file by the values of a column name, which can be passed as input(vColName), assuming one sheet in the file:

import sys
import os, shutil
from os import listdir
from os.path import isfile, join
import pandas as pd
import urllib as ul
import datetime
import xlrd

#this method retrieves all the xlsx filenames from a folder
def find_excel_filenames( path_to_dir, suffix=".xlsx" ):
    filenames = listdir(path_to_dir)
    return [ filename for filename in filenames if filename.endswith( suffix ) ]

#this folder contains .xlsx files
filePath = "D:\files\sample\"

#there is a subfolder in my solution to move the processed files to
#and another subfolder to move the splitted output files
archivePath = os.path.join(filePath, "archive")
outPath = os.path.join(filePath, "output")

#get a list of filenames
fnames = find_excel_filenames(filePath)

#loop through each file
for fl in fnames:
    vFile = os.path.join(filePath, fl)
    #load the content of the file to a data frame, 
    #I open the file twice, first to get the number of columns and
    #create the converter, then to open the file with string converter
    #it helps with trimming of leading zeros

    df = pd.read_excel(vFile, header=None)

    column_list = []
    for i in df:
        column_list.append(i)

    converter = {col: str for col in column_list} 

    df1 = pd.read_excel(vFile, converters=converter)
    colValues=df1[vColName].unique().tolist()

    for v in colValues:
        filteredDF = df1.loc[df1[vColName]==v]
        vOutFile = os.path.join(outPath, fl+''_''+v.replace("/"," ")+''.xlsx'')
        writer = pd.ExcelWriter(vOutFile, engine=''xlsxwriter'')
        # Convert the dataframe to an XlsxWriter Excel object.
        filteredDF.to_excel(writer, sheet_name=''Sheet1'')
        # Close the Pandas Excel writer and output the Excel file.
        writer.save()

    #move the processed file to an archive folder
    dst_file = os.path.join(archivePath, fl)
    if os.path.exists(dst_file):
        os.remove(dst_file)
    shutil.move(vFile, archivePath)