3

I have read several articles on machine learning.

An example is this https://medium.com/technology-invention-and-more/how-to-build-a-simple-neural-network-in-9-lines-of-python-code-cc8f23647ca1 and was wondering if it's possible to modify the code to do the use case below.

Let's say the machine is given an excel file (assuming .xlsx) that has only 1 sheet.

Inside the sheet there will be a table (defined as having a header row and at least 2 data rows and will be agreed as a table by common sense humans who read the file. Not a named table or object as defined by MS Excel)

The following is unknown before hand:

  • the position of the table
  • the number of rows and columns
  • the data type in the table

This table in the excel file is also NOT named or identified beforehand in any way. There are no objects in this excel file. No named tables.

If I were to attempt to use VBA to list objects, I will get this error message

enter image description here

There will be at least 1 other cell outside of this table that will contain some text or value.

How or what algorithm using what software library can I train a machine to identify where does the table start and end?

E.g.

The table first cell (identified as the top left most cell in the header row) is A4 and the table last cell (identified as the bottom right most cell) is G12

The machine is able to read the excel file and say that the table is at A4:G12 even if there are other cells outside the table that has data.

My various google search involving "excel and machine learning" or "identify tables in excel" tends to give me articles on how to feed data to machine learning software using excel files or how to find/name data tables that are named in the Excel. I want to emphasize the table data is NOT named.

Also come across answers talking about extracting such table data from PDFs. But my focus is on Excel.

I have read that machine learning can be used to read images. So I imagine trying to "read" an excel file which is a highly structured file just to find where a table is should be possible.

Here are some sample screenshots

enter image description here

enter image description here

Here are the links to the Excel files

https://www.dropbox.com/s/l3vjjsgunp0zu23/A4toG12.xlsx?dl=0

https://www.dropbox.com/s/nwzw0211ruhwvf0/G7toN19.xlsx?dl=0

Please note that these are dummy files to illustrate my point about identifying the table data.

  • The first one should be identified as A4 to G12
  • The second one should be identified as G7 to N19

The comment by JanLauGe about ctrl + a is interesting See Using machine learning to parse excel file and extract table data with no named tables involved? If so, how do I get started?

I have googled for the equivalent in VBA but to no avail.

What I have tried

I wrote a script as suggested by Scott Craner to find all the cells with values I get this.

Thanks to his comments, I got it to work.

Public Sub LookForCells()
    For Each block In ActiveSheet.Columns("A:Z").SpecialCells(xlCellTypeConstants, 23).Areas
       MsgBox block.Count
    Next block
End Sub

And you should see the following

enter image description here

Kim Stacks
  • 10,202
  • 35
  • 151
  • 282

3 Answers3

2

I realize this is an old question, but found it while considering this topic and going through the exercise of extracting tabular data from Excel files. We also concluded that ML wasn't the right answer, at least for now, but that a rules-based approach was better.

The rules that we landed on were to identify table "corners" that had either a blank cell or a sheet boundary above and to the left, that were part of at least a 2x2 matrix of densely or sparsely populated data.

The second thing that helped was to extract greedily, and rely on filtering to narrow down the data we were interested in.

If you are still interested in this topic, I released a Python tool on on GitHub and PyPI (called "eparse") that crawls a filesystem for Excel spreadsheets, parses tabular data from them, and injects the parsed tabular data into a database. I also wrote an article about it.

  • Thank you for writing the article and sharing your python library, Chris. I need sometime to try it out before I can respond in full – Kim Stacks Jun 14 '23 at 01:45
1

I think for your particular use case you will be much better off with more conventional approaches, e.g.:

  • rule-based approaches (largest area of connected cells)
  • VBA script as provided by @Scott Craner in the comment above

However, to answer your question: To apply the most conventional flavours of machine learning (supervised learning) to your problem you will need labelled training data. This would consist of a number of excel spreadsheets for which the table is known and saved, ideally as the index of cells that are part of the table.

An alternative would be to use a pre-trained model if you can find one, or more sophisticated methods such as reinforcement learning (but not sure how this would work in this case)

JanLauGe
  • 2,297
  • 2
  • 16
  • 40
  • 1
    >This would consist of a number of excel spreadsheets for which the table is known and saved, ideally as the index of cells that are part of the table. I can provide that. How do I then go about doing this? Also if u talk about rule-based approach, how do I use that to identify the largest area of connected cells? Also the vba script provided assumed that the table was named. It's not. – Kim Stacks Dec 01 '17 at 14:48
  • 2
    How many labelled tables do you have? You mentioned deep learning, that would require thousands training of examples. For rule-based approaches, I would try to write a script that starts from the top-left of the sheet and works its way across to the bottom-right. For every cell with content that is encountered it checks the number of connected cells (pretty much what is selected if you go to that cell and press `ctrl` + `a`). The largest connected area on the spreadsheet is likely to be your table. Happy to be more concrete if you can provide some example data. – JanLauGe Dec 01 '17 at 14:55
  • 1
    See my question with some dummy examples. The `ctrl` + `a` is interesting. I didn't realise that was possible. I have googled for its equivalent in VBA but to no avail. And actually any other cell when you do `ctrl` + `a` selects the whole sheet. Probably should be the 2nd largest area selected then. – Kim Stacks Dec 01 '17 at 15:15
  • 1
    Yes, you're right about the second largest, because any empty cell outside of the table will select the entire worksheet. About your data, do you have any/many `` or empty fields in the table? – JanLauGe Dec 01 '17 at 18:14
1

I concur with JanLauGe. This problem doesn't seem reasonably suitable for machine learning in my opinon.

I did stumble across this relevant answer so perhaps this may help you.

How to automatically detect tables in Excel

Tophat
  • 143
  • 5
  • 1
    That's really weird. The link you sent is basically asking the same question. Though the answer provided is also assuming that the table in question is named. It's not. I have also attempted a similar macro using `ListObjects` I get a Object required error message. See my question for updates. – Kim Stacks Dec 01 '17 at 14:55