0

I am currently trying to develop macros/programs to help me edit a big database in Excel.

Just recently I successfully wrote a custom macro in VBA, which stores two big arrays into memory, in memory it compares both arrays by only one column in each (for example by names), then the common items that reside in both arrays are copied into another temporary arrays TOGETHER with other entries in the same row of the array. So if row(11) name was "Tom", and it is common for both arrays, and next to Tom was his salary of 10,000 and his phone number, the entire row would be copied.

This was not easy, but I got to it somehow.

Now, this works like a charm for arrays as big as 10,000 rows x 5 columns + another array of the same size 10,000 rows x 5 columns. It compares and writes back to a new sheet in a few seconds. Great!

But now I tried a much bigger array with this method, say 200,000 rows x 10 columns + second array to be compared 10,000 rows x 10 columns...and it took a lot of time.

Problem is that Excel is only running at 25% CPU - I checked that online it is normal.

Thus, I am assuming that to get a better performance I would need to use another 'tool', in this case another programming language.

I heard that Python is great, Python is easy etc. but I am no programmer, I just learned a few dozen object names and I know some logic so I got around in VBA.

Is it Python? Or perhaps changing the programming language won't help? It is really important to me that the language is not too complicated - I've seen C++ and it stings my eyes, I literally have no idea what is going on in those codes.

If indeed python, what libraries should I start with? Perhaps learn some easy things first and then go into those arrays etc.? Thanks!

  • I like Python. You can use libs such as Pandas. It can easily handle your data. – Maeda Jun 15 '19 at 04:15
  • Like I said, I am not a programmer, in fact I never had any education in programming or anything. I heard a lot about Python being really versatile from many different sources. I know that Python is a library based language - there are many libraries, each has some characteristic and some specific use. I am just wondering if coding things like array manipulation will not be too hard for somebody who is not an IT person. –  Jun 15 '19 at 04:25
  • As @Maeda Suggested, If you want to go with Python, you can use [Pandas](https://www.youtube.com/watch?v=CmorAWRsCAw&list=PLeo1K3hjS3uuASpe-1LjfG5f14Bnozjwy) . Link will rovide you the tutorials on How to start with it. Other than Pandas I would suggest that you use SQL for this Purpose. – Mikku Jun 15 '19 at 04:36

2 Answers2

0

I have no intention of condescending but anything I say would sound like condescending, so so be it.

The operation you are doing is called join. It's a common operation in any kind of database. Unfortunately, Excel is not a database.

I suspect that you are doing NxM operation in Excel. 200,000 rows x 10,000 rows operation quickly explodes. Pick a key in N, search a row in M, and produce result. When you do this, regardless of computer language, the computation order becomes so large that there is no way to finish the task in reasonable amount of time.

In this case, 200,000 rows x 10,000 rows require about 5,000 lookup per every row on average in 200,000 rows. That's 1,000,000,000 times.

So, how do the real databases do this in reasonable amount of time? Use index. When you look into this 10,000 rows of table, what you are looking for is indexed so searching a row becomes log2(10,000). The total order of computation becomes N * log2(M) which is far more manageable. If you hash the key, the search cost is almost O(1) - meaning it's constant. So, the computation order becomes N.

What you are doing probably is, in real database term, full table scan. It is something to avoid for real database because it is slow.

If you use any real (SQL) database, or programming language that provides a key based search in dataset, your join will become really fast. It's nothing to do with any programming language. It is really a 101 of computer science.

I do not know anything about what Excel can do. If Excel provides some facility to lookup a row based on indexing or hashing, you may be able to speed it up drastically.

Naoyuki Tai
  • 403
  • 4
  • 7
  • Thank you! I will have to spend more time and try to learn something more 'database' related. Great answer, it actually enlightened me! And I am not being ironic, I think that I stated in the post that I am not a programmer and looking for help because I have no idea where to even start! Big thanks! –  Jun 15 '19 at 19:29
0

Ideally you want to design a database (there are many such as SQLite, PostgreSQL, MySQL etc.) and stick your data into it. SQL is the language of talking to a database (DML data manipulation language) or creating/editing the structure of the database (DDL data definition language).

Why a database? You’ll get data validation and the ability to query data with many relationships (such as One to Many, e.g. one author can have many books but you’ll have an Author table and a Book table and will need to join these).

Pandas works not just with databases but CSV and text files, Microsoft Excel, HDF5 and is great for reading and writing to these in memory structures as well as merging, joining, slicing the data. The quickest way to what you want is likely read the data you have into panda dataframes and then manipulate from there. This makes a database optional though recommended. See Pandas Merging 101 for an idea of what you can do with pandas.

Another python tool you could use is SQLAlchemy which is an ORM object relational mapper (converts say a row in an Author table to an Author class object in python). Whilst it’s important to know SQL and database principles you don’t need to use SQL statements directly when using SQLAlchemy.

Each of these areas are huge like the ocean. You can dip your toes into each but if you wade in too deep you’ll want to know how to swim. I have fist-sized books on each to give (that I’ve not finished) you a rough idea what I mean by this.

A possible roadmap may look like:

Database (optional but recommended):

  1. Learn about relational data
  2. Learn database design
  3. Learn SQL

Pandas (highly recommended):

  1. Learn to read and write data (to excel / database)
  2. Learn to merge, join, concatenate and update a DataFrame
Andrew Allen
  • 6,512
  • 5
  • 30
  • 73
  • Thank you! I guess I will have to look at SQL vs Pandas and try to decide what to choose basing on the difficulty I guess, as I am not a professional coder ! –  Jun 15 '19 at 19:30
  • I'm definitely not a professional coder (degree in math but never programmed). I sound like I know what I'm talking about as I've had more exposure but this stuff is mostly trial and error and but to get good requires occasionally reading documentation. [This](https://pbs.twimg.com/media/DvvUIpGVYAAhwGx.jpg) is my life. My email is on my profile if you ever just need to get unstuck. – Andrew Allen Jun 15 '19 at 20:03
  • Thanks, and I couldn't agree more regarding that picture! –  Jun 15 '19 at 20:33