Splitting up large data sets into smaller pieces for web viewing

Question

Background

I am developing a web platform for web applications. One of the key functions of this platform is user management. Therefore I have implemented a login system that enables registered users to log into the platform and the application. This is implemented using a MySQL database which contains login credentials, userids, contact information, etc...

The Problem

The issue that I am running in to is with the users table. This table can have the potential to be quite large. So if an admin wants to edit the information for a specific user, depending on where the user is in the list, the admin will have to scroll through potentially thousands of records to find the correct one. I am looking for is an algorithm that will split this up into groups of about size N and display those groups as ranges that the admin can click on: #0-B, C-F, G-L, M-R, etc....

Research

This is the research that I have conducted so far:

Database partition - Better done by PHP or MySQL?

1 very large table or 3 large table? MySQL Performance

With partitioning, there are many articles I found on the web to partition a database table, but nothing addresses what I'm trying to do.

What I have done thus far

I have written a program that runs offline from the web server that returns the counts for each bucket 0-9 and A-Z. For counts less than the threshold, it combines the buckets into a range and saves that to the database. So if total count of buckets 0-9 and A, B are less than the threshold, then the range becomes #0-B. I'm having problems trying to figure out how to get a range from a bucket if the size of the bucket is greater than the threshold.

I have considered the following solutions:

A recursive function that keeps drilling down the usernames until it's within the threshold range. This would be ideal if I could get it working.
Since this is running offline, another solution that I am entertaining is loading the entire bucket into memory and splitting it up from there. The drawback for this is the potential memory usage.
Another idea that I'm entertaining is a variation on #2 which is load parts of the bucket into memory and process it. This will reduce the memory usage to more or less a fixed amount, but it will require more time to process.

EDIT 5-11/2020:

Based on the one answer and the comments, I could have a search field that would use JSON to populate the list when the number of entries is narrowed down to something below the threshold value (That is a good idea). But, I do have code which I will share with you:

This code works, but with the test data, it creates ~17,500 entries in the partition table, and there's quite a few entries that have zero counts.

function part_recurse2($prefix)
{
    global $CONFIGVAR;
    global $charList;
    global $charCount;

    $list = dbCountScan($prefix);
    for ($i = 0; $i < count($list); $i++)
    {
        $char = substr($charList, $i, 1);
        if ($list[$i] >= $CONFIGVAR['user_partition_max_size']['value'])
        {
            part_recurse2($prefix . $char);
        }
        else
        {
            writeRange($prefix . $char, $prefix . $char, $list[$i],
                substr($prefix . $char, 0, 1));
        }
    }
}

And the results from the database...

mysql> select sum(`count`) from userpart;
+--------------+
| sum(`count`) |
+--------------+
|       100004 |
+--------------+
1 row in set (0.01 sec)

mysql> select count(*) from userpart where count = 0;
+----------+
| count(*) |
+----------+
|     1139 |
+----------+
1 row in set (0.01 sec)

This code partially works, but the counts do not add up, and there are zeros in it as well. The correct user count is 100,004 but the below function produces a total count that is 105,234 which is 5,230 more entries in the users table than there actually is. The nice thing about this version is that it will combine ranges until the threshold is met. This is what I would like to get working.

function part_recurse($prefix)
{
    global $charCount;
    global $charList;
    global $CONFIGVAR;

    $list = dbCountScan($prefix);
    $acc = 0;
    $start = substr($charList, 0, 1);
    for ($i = 0; $i < count($list); $i++)
    {
        if ($list[$i] == 0) continue;
        if ($list[$i] > $CONFIGVAR['user_partition_max_size']['value'])
        {
            // If the number of entries > threshold...
            if ($i == 0)
            {
                // Only for the first bucket.
                $start = substr($charList, 1, 1);
            }
            else
            {
                // All other entries.
                if ($i >= $charCount - 1)
                {
                    // Last entry
                    $end = substr($charList, $i - 1, 1);
                    $acc += $list[$i];
                    writeRange($prefix . $start, $prefix . $end, $acc,
                        substr($prefix . substr($charList, $i, 1), 0, 1));
                    $acc = 0;
                }
                else
                {
                    $end = substr($charList, $i - 1, 1);
                    writeRange($prefix . $start, $prefix . $end, $acc,
                        substr($prefix . substr($charList, $i + 1, 1), 0, 1));
                    $acc = 0;
                    $start = substr($charList, $i, 1);
                }
            }
            part_recurse($prefix . substr($charList, $i, 1));
        }
        else
        {
            if (($acc + $list[$i]) >= $CONFIGVAR['user_partition_max_size']['value'])
            {
                $end = substr($charList, $i - 1, 1);
                writeRange($prefix . $start, $prefix . $end, $acc,
                    substr($prefix . substr($charList, $i, 1), 0, 1));
                $start = substr($charList, $i, 1);
                $acc = $list[$i];
            }
            else
            {
                $acc += $list[$i];
            }
        }
    }

    // Write the final entry.
    if ($acc > 0)
    {
        $end = substr($charList, $charCount - 1, 1);
        $bucket = substr($prefix . substr($charList, $i, 1), 0, 1);
        writeRange($prefix . $start, $prefix . $end, $acc, $bucket);
    }
}

And the database results for it:

mysql> select sum(`count`) from userpart;
+--------------+
| sum(`count`) |
+--------------+
|       105234 |
+--------------+
1 row in set (0.00 sec)
mysql> select count(*) from userpart where count = 0;
+----------+
| count(*) |
+----------+
|      316 |
+----------+
1 row in set (0.00 sec)

The correct number of entries is 100,004 with no zero counts. I will keep refining the code, but if someone sees something that I am doing wrong, please enlighten me. The off book functions have the following properties:

dbCountScan($prefix): This function steps through the characters 0-9 and A-Z using SELECT COUNT(*) FROM USERS WHERE username like '?%'; where ? is a concatenation of $prefix and the current letter in the for loop.

WriteRange($start, $end, $count, $bucket): This function writes the range to the database. $start and $end is the start and end of the range. $count is the number of entries in the range. And $bucket is the top level bucket (0-9, A-Z) that this range belongs to.

And here's the database table userpart:

CREATE TABLE `userpart` (
  `rangest` varchar(16) NOT NULL COMMENT 'Database query starting range.',
  `rangeed` varchar(16) NOT NULL COMMENT 'Database query ending range.',
  `count` int unsigned NOT NULL COMMENT 'Count in range.',
  `bucket` varchar(5) DEFAULT NULL COMMENT 'Primary bucket.'
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='Table that is used to partition usernames.'

"if an admin wants to edit the information for a specific user"...aren't you going to provide a search feature to enable that kind of thing in a user-friendly way? And as Strawberry says, the rest of what you've described does appear to just be regular pagination. I think you may have over-engineered your proposed solution. — ADyson, Apr 18 '20 at 12:27
@Strawberry If there are say 100,000 users on the system, then pagination really isn't going to work. — Daniel Rudy, May 11 '20 at 20:52
@ADyson I didn't think about including a search field. I have worked out how that will work. As for pagination, I'm not sure that would be feasable with say 100,000 accounts on the system. Even 10,000 wouldn't exactly be manageable. — Daniel Rudy, May 11 '20 at 20:55
With the volumes of users, I think any kind of list is probably going to be pretty futile, whether partitioned, paginated or whatever. A search feature (maybe relatively flexible so it can find users based on a combination of attributes, and partial matches on fields etc) seems like the best bet to me. — ADyson, May 11 '20 at 21:08

Rick James · Accepted Answer · 2020-05-12T21:23:54.860

scroll through potentially thousands of records

Write queries, possibly ad hoc, to filter down to fewer records. If the admin needs to scroll through more than a few dozen records, you (and the database) are failing him.

Pagination -- Don't use OFFSET; "remember where you left off": http://mysql.rjweb.org/doc.php/pagination

Partitioning -- No; you have not presented a case where it would benefit. You can simply say WHERE name LIKE 'J%' or WHERE name >= 'P' AND name < 'T' (for p,q,r,s) and have an INDEX starting with name; but I doubt if that really helps the admin.

"A recursive function that keeps drilling down" -- That's sort of what a BTree index already does for you. Simply do WHERE name > $leftoff ORDER BY name LIMIT 20. That is, use LIMIT for the bucket size; meanwhile don't bother predefining the bucket boundaries.

"potential memory usage" -- It is usually better to simply let the database have most of the available memory.

1 table vs 3 -- Please elaborate on what would be in those tables.

Searching

As others have said, using some search mechanism is likely to be the fastest way to get to the desired record(s). Do not make the admin scroll through thousands of rows.

Provide a form with one or more things for the admin to fill in

The full user name. (Problem: typos / not knowing the exact spelling)
Wild-card partial name. Example: 'Dan%'; then display all user names starting with Dan.
A massive list of hyperlinks, one per user. This is viable up to a few thousand; I would advise against it for more than that.
An attribute of the user -- starting date, no activity, whatever. Then search on that attribute.

For these partial situations, refuse to show more than 100 usernames; if more than that, demand that the admin provide more details. Don't bother the added complexity of pagination.

I have implemented a variety of those, and other, mechanisms. Thinking back, the only time I used pagination for more than a few dozen items was when I need to take action on all items. Example: picking which of a thousand pictures from a trip to move into an "album". This involved looking at each picture long enough to pick or reject each one. Also, I used AJAX to make a single click all the action needed.

I updated the question with more information. As for the linked question 1 table vs. 3. I'm not sure what would be in those tables either. Right now, I have two tables. The master users table which links the username to the userid, specifies if the account is active, login method, profile id, and the organization's id for that user. The other table is userpart which I have posted the schema. — Daniel Rudy, May 11 '20 at 20:49
@DanielRudy - I gotta go with the others on pushing some search mechanism. (I added to my Answer.) Meanwhile, I got lost in your discussion of counts. — Rick James, May 12 '20 at 21:25
The counts are somewhat of a red herring. They are used to keep track of how many users are in a bucket and are used for diagnostic purposes while I'm developing the platform. — Daniel Rudy, May 16 '20 at 04:34
I solved this awhile ago by storing a tree in the database. I'm using the FK to parent and a path to the node. If the data changes, the entire table is emptied and the data is recreated. Using this method, I now have a drill-down of the users list. The RDBMS method to pull it off is the first answer, method 3, located here: https://stackoverflow.com/questions/3362669/what-are-the-known-ways-to-store-a-tree-structure-in-a-relational-db — Daniel Rudy, Jun 05 '21 at 04:59

Splitting up large data sets into smaller pieces for web viewing

1 Answers1

Linked