1

What is the easiest way to read the highlighted table in the screenshot below from Excel into a Pandas DataFrame? Suppose I have thousands of worksheets like this. The area I want to read has "Col4" at the top-left corner and doesn't have an entire blank row or column. "Col4" can appear at any (row, column) on the worksheet.

I suppose I can always go with the brutal force approach, where I read the entire sheet first, find the position of "Col4", and then extract the part I want. But I am wondering if there is any easier way to do it.

Also, I have only worked with Pandas so far. I know there are many other packages besides Pandas such as xlwings or xlrd. If you know any of these packages can be helpful please let me know and it will be very appreciated as well.

enter image description here

Note that this question is not a duplicate of pandas read_excel multiple tables on the same sheet, because the solution in that post only handles the case where the row offset is known beforehand.

The business problem behind this I am solving is to read many spreadsheets created by non-engineer crews (HR, accounting, etc.) in my company, and unfortunately they didn't create the spreadsheets in a consistent and programming-friendly way.

GoCurry
  • 899
  • 11
  • 31
  • Do the arrays all look the same? Can you just find the starting column header and run it to the end? – Hatt Aug 27 '18 at 19:04
  • Re: Hatt. Thank you. Please pardon my programming skills. I don't understand your question. Do you mind to be more specific? – GoCurry Aug 27 '18 at 19:24
  • Does every excel sheet have Col4, Col5, Col6, blank as the result you're looking for? – Hatt Aug 27 '18 at 19:51
  • No, only Col4 is on every sheet. The union of all columns on all sheets can be good guessed though, if that helps. Let's say the columns can only be Col4 - Col20 and many blanks. – GoCurry Aug 27 '18 at 23:42
  • Were you able to find a solution to the problem. I have a similar use case – Kuncheria Sep 04 '20 at 05:29

1 Answers1

0

Python is pretty powerful, but I don't think you're going to get the kind of flexibility that you are looking for using it With Excel. Maybe someone will post a solution, but if not, you can use VBA for this task, and when everything is aggregated into one single sheet, use Python to read from that single source.

Sub CopyRangeFromMultipleSheets()

'Declaring variables
Dim Source As Worksheet
Dim Destination As Worksheet
Dim SourceLastRow, DestLastRow As Long

Application.ScreenUpdating = False

'Looping through all sheets to check whether "Master" sheet exist
For Each Source In ThisWorkbook.Worksheets
    If Source.Name = "Master" Then
        MsgBox "Master sheet already exist"
        Exit Sub
    End If
Next

'Inserting a new sheet after the "Main" sheet
Set Destination = Worksheets.Add(after:=Sheets("Sheet1"))

Destination.Name = "Master"

'Looping through all the sheets in the workbook
For Each Source In ThisWorkbook.Worksheets

    'Preventing consolidation of data from "Main" and "Master" sheet
    If Source.Name <> "Master" Then

        SourceLastRow = Source.Range("A1").SpecialCells(xlLastCell).Row

        Source.Activate

        If Source.UsedRange.Count > 1 Then

            DestLastRow = Sheets("Master").Range("A1").SpecialCells(xlLastCell).Row

            If DestLastRow = 1 Then
                'copying data from the source sheet to destination sheet
                Source.Range("D1", Range("T1").SpecialCells(xlLastCell)).Copy Destination.Range("A" & DestLastRow)
            Else
                Source.Range("D1", Range("T1").SpecialCells(xlCellTypeLastCell)).Copy Destination.Range("A" & (DestLastRow + 1))
            End If

        End If
    End If
Next

Destination.Activate

Application.ScreenUpdating = True

End Sub
ASH
  • 20,759
  • 19
  • 87
  • 200