0

I would like to take a 10% copy of the data in a production database, keeping its integrity and restore it into a new database.

Is there a method that allows this to be done in SQL Server? I have looked at creating an SSIS that exports the database schema and data, then putting a row sampling task to reduce the amount of data flowing into the new database but I wondered if there is a better method of doing this?

Nutty81
  • 33
  • 4
  • 3
    _10% copy..._ Of what? tables or content of all tables? – B001ᛦ Sep 05 '18 at 12:14
  • Yes. All the tables and 10% of the data within them. Approximately, due to dependencies this might be larger in some tables etc. – Nutty81 Sep 05 '18 at 12:17
  • Asking for tool recommendations is off topic and your question is too broad so you're unlikely to get an answer. Try to do some research and solve the problem and come back with specific questions about where you are having trouble and you'll get the help you need then. My advice would be to have an empty development database scripted in source control with scripts to populate dummy data for development purposes. Check out [dbup](https://dbup.github.io/) I've used that in the past for what I've suggested. – Tanner Sep 05 '18 at 12:18
  • I have updated my question, as per your suggestion. – Nutty81 Sep 05 '18 at 12:25

2 Answers2

2

This is a hard topic, and it does take some work. There are a few approaches here, and I'll add something in addition to what HoneyBadger noted.

First, I do know of Data Bee that subsets databases, but this might not be sufficient for you. You can use a trial of this to see if it works.

Second, I generally recommend you get a curated data set for development areas (dev, test, UAT, etc) that contains the cases of the problem domains you need to solve. There are two ways of doing this. One is data virtualization software, which is what Redgate uses in SQL Clone, Delphix in their products, and a few others. This essentially copies production once and then shares that out to all devs/qa/etc. It reduces much of the time/storage required for getting copies. That can help.

The other way is to build a dataset, and here's what I do for some clients.

If you are working in mortgages, as HoneyBadger notes, you might need to account for fixed loans, variable APR, various terms, etc. Often the easiest way to do this is learn what cases your customers are asking about for new features, or what business analysts use and then copy those items to a new database. These will likely be relative few rows of tranasactional data and all lookup type data.

This is also an ongoing process as you realize you've missed things. Save that database in VCS or a known location and use it as a source for your system. If your devs, your build system, your QA, all pull from here, you get a consistent set of data. You can augment this with random data, something like Redgate Data Generator, to help fill in some values.

An additive approach is often much easier than subtractive. With that, also keep in mind that masking/obfuscating sensitive data is important. I would be more careful here as the GDPR and other legislation becomes enforced.

Disclosure: I work for Redgate Software.

way0utwest
  • 644
  • 1
  • 6
  • 8
  • Fantastic, thank you for this. I will definitely look at Redgate tools for Data Generations as I already use several other products. – Nutty81 Sep 13 '18 at 08:39
0

While off topic for SO, I did want to give some guidelines you can follow:

I do this fairly regularly, but it is mainly handwork. First you have to decide what the "main entity" of your database is: Does it concern people, accounts, credit cards, or something else? I work mainly in the financial sector, so for me it is usually something like accounts/mortgages etc. But it can be anything, really. You have to decide what everything in your database is related to, the "base entity" so to speak.

Once you have decided on your main entity, you can choose the 10% of your database. For example, if your main entity is Accounts, you can select 10% of AccountId's from an Account table. This 10% you can put in an table.

Next comes the hard work: you'll have to write queries for each and everyone of your tables correlating the respective entities to your main entity. So, if your main entity is a person, you want all that persons addresses, all their accounts, all their phone numbers, all their history related to them. These queries can get quite complicated, and you really need to understand your database well. You'll get queries like:

SELECT      Src.*
INTO        [dbo].[GTP_MSI_MORTGAGERELATION_MORTGAGE_RELATION]
FROM        [ATV].[GTP_MSI_MORTGAGERELATION_MORTGAGE_RELATION] AS Src 
INNER JOIN  atv.GTP_MSI_MORTGAGEREQUEST_APPLICANT APP
        ON  APP.MORTGAGE_RELATION_ID = Src.MORTGAGE_RELATION_ID
INNER JOIN  ATV.GTP_MSI_MORTGAGELOAN_LOAN MLL
        ON  MLL.MORTGAGE_REQUEST_ID = APP.REQUEST_ID
INNER JOIN  dbo.DOOR D
        ON  CONVERT(VARCHAR(255), D.NUMHYP) = MLL.LOAN_NUMBER

In this example the dbo.DOOR table contains a selection of mortgageId's in scope (the example finds all relations between all persons/organizations associated to a mortgage).

What I most often do, is have production data (of some extraction in time) in some schema, and use queries like above to fill the dbo schema. So [ATV].[GTP_MSI_MORTGAGERELATION_MORTGAGE_RELATION] in the example above would contain production data, while its dbo namesake contains a smaller set of that production data (the data related to the mortgages in scope). Once I have the dbo schema filled, I can use anonimization software (I tend to use Red Gate Datagenerator) to remove private/business sensitive information. The anonimized data I can then extract, and use as source for my development database.

See also this (At the moment unfortunately only in Dutch, but if you push it trough a translator it should still make a lot of sense.)

HoneyBadger
  • 14,750
  • 3
  • 34
  • 48
  • Thank you for pointing me in this direction. I am not oppose to the hard work of writing the population scripts. I wondered if there was a more efficient method to do this. Thank you once again. – Nutty81 Sep 05 '18 at 12:38
  • Some things you can automate to make things easier. I use dynamic SQL to write most of the scripts for me, most relations are pretty straightforward. A table `address` which has a column `personId`, you can probably just join to a `person` table, with a PK column of `Id` (as an example). Usually I have to write about 25% of queries completely by hand. – HoneyBadger Sep 05 '18 at 12:41