3

I'm reading and parsing CSV files into a SQL Server 2008 database. This process uses a generic CSV parser for all files.

The CSV parser is placing the parsed fields into a generic field import table (F001 VARCHAR(MAX) NULL, F002 VARCHAR(MAX) NULL, Fnnn ...) which another process then moves into real tables using SQL code that knows which parsed field (Fnnn) goes to which field in the destination table. So once in the table, only the fields that are being copied are referenced. Some of the files can get quite large (a million rows).

The question is: does the number of fields in a table significantly affect performance or memory usage? Even if most of the fields are not referenced. The only operations performed on the field import tables are an INSERT and then a SELECT to move the data into another table, there aren't any JOINs or WHEREs on the field data.

Currently, I have three field import tables, one with 20 fields, one with 50 fields and one with 100 fields (this being the max number of fields I've encountered so far). There is currently logic to use the smallest file possible.

I'd like to make this process more generic, and have a single table of 1000 fields (I'm aware of the 1024 columns limit). And yes, some of the planned files to be processed (from 3rd parties) will be in the 900-1000 field range.

For most files, there will be less than 50 fields.

At this point, dealing with the existing three field import tables (plus planned tables for more fields (200,500,1000?)) is becoming a logistical nightmare in the code, and dealing with a single table would resolve a lot of issues, provided I don;t give up much performance.

Evan
  • 45
  • 5
  • 3
    "have a single table of 1000 fields " - bad design. – Mitch Wheat Jan 06 '15 at 03:25
  • How long it takes for a single import? What's your expectation in terms of performance gain? – qamar Jan 06 '15 at 03:31
  • It can take from a few seconds to minutes, depending on the file. The importing of the text into the database isn't the issue, the issue is the performance hit associated with a lot of unused fields (depending on the file width). I want to get rid of complexity in the code, without taking a performance hit. – Evan Jan 06 '15 at 03:35
  • @Mitch - Design depends on context. In the context of generic CSV parsing into a database, a table with 1000 raw fields not necessarily bad design. The fact that SQL Server can't deal with a row with that many fields (with data in them) is a limitation. – Evan Jan 08 '15 at 02:03
  • IMO - it's a bad design and indicates there is probably a better way to approach your problem of " generic CSV parsing into a database" – Mitch Wheat Jan 08 '15 at 03:39
  • @Evan, even though you can mark only one answer as accepted, you can upvote several answers to indicate that they were useful. – Vladimir Baranov Jan 09 '15 at 00:59
  • @Vladimir Baranov - Sorry, I don't have enough rep to upvote yet... I lost my other account, so I've had to create a new one. – Evan Jan 17 '15 at 01:17

3 Answers3

2

First, to answer the question as stated:

Does the number of fields in a table affect performance even if not referenced?

  • If the fields are fixed-length (*INT, *MONEY, DATE/TIME/DATETIME/etc, UNIQUEIDENTIFIER, etc) AND the field is not marked as SPARSE or Compression hasn't been enabled (both started in SQL Server 2008), then the full size of the field is taken up (even if NULL) and this does affect performance, even if the fields are not in the SELECT list.

  • If the fields are variable length and NULL (or empty), then they just take up a small amount of space in the Page Header.

  • Regarding space in general, is this table a heap (no clustered index) or clustered? And how are you clearing the table out for each new import? If it is a heap and you are just doing a DELETE, then it might not be getting rid of all of the unused pages. You would know if there is a problem by seeing space taken up even with 0 rows when doing sp_spaceused. Suggestions 2 and 3 below would naturally not have such a problem.

Now, some ideas:

  1. Have you considered using SSIS to handle this dynamically?

  2. Since you seem to have a single-threaded process, why not create a global temporary table at the start of the process each time? Or, drop and recreate a real table in tempdb? Either way, if you know the destination, you can even dynamically create this import table with the destination field names and datatypes. Even if the CSV importer doesn't know of the destination, at the beginning of the process you can call a proc that would know of the destination, can create the "temp" table, and then the importer can still generically import into a standard table name with no fields specified and not error if the fields in the table are NULLable and are at least as many as there are columns in the file.

  3. Does the incoming CSV data have embedded returns, quotes, and/or delimiters? Do you manipulate the data between the staging table and destination table? It might be possible to dynamically import directly into the destination table, with proper datatypes, but no in-transit manipulation. Another option is doing this in SQLCLR. You can write a stored procedure to open a file and spit out the split fields while doing an INSERT INTO...EXEC. Or, if you don't want to write your own, take a look at the SQL# SQLCLR library, specifically the File_SplitIntoFields stored procedure. This proc is only available in the Full / paid-for version, and I am the creator of SQL#, but it does seem ideally suited to this situation.

  4. Given that:

    • all fields import as text
    • destination field names and types are known
    • number of fields differs between destination tables

    what about having a single XML field and importing each line as a single-level document with each field being <F001>, <F002>, etc? By doing this you wouldn't have to worry about number of fields or have any fields that are unused. And in fact, since the destination field names are known to the process, you could even use those names to name the elements in the XML document for each row. So the rows could look like:

    ID  LoadFileID  ImportLine
    1   1           <row><FirstName>Bob</FirstName><LastName>Villa</LastName></row>
    2   1           <row><Number>555-555-5555</Number><Type>Cell</Type></row>
    

    Yes, the data itself will take up more space than the current VARCHAR(MAX) fields, both due to XML being double-byte and the inherent bulkiness of the element tags to begin with. But then you aren't locked into any physical structure. And just looking at the data will be easier to identify issues since you will be looking at real field names instead of F001, F002, etc.

  5. In terms of at least speeding up the process of reading the file, splitting the fields, and inserting, you should use Table-Valued Parameters (TVPs) to stream the data into the import table. I have a few answers here that show various implementations of the method, differing mainly based on the source of the data (file vs a collection already in memory, etc):

Community
  • 1
  • 1
Solomon Rutzky
  • 46,688
  • 9
  • 128
  • 171
  • All the fields are VARCHAR(MAX). Other than some table management fields. There is an INT IDENTITY as Primary Key. Many file loads go into the same table (separated by load file ids. Data is deleted (by file load) after a week or two (depending on how long it takes for the business to report problems). The database is re-organised regularly. – Evan Jan 08 '15 at 02:05
  • 1) SSIS is not an option. The functionality under question is a replacement for SSIS, which handles more types of things that SSIS can. And with more flexibility. 2) One of the main points behind the design is that the data is auditable and production supportable. So, everything goes into permanent tables, hangs around for a few weeks and is then deleted (after backup). – Evan Jan 08 '15 at 02:11
  • 3) There are variety of different CSV formats I have to handle, some mutually exclusive from each other. The design is that there is a raw file data table, then the data gets moved from there by context. This helps with production support, when the business discovers a problem with something that was processed 8 days ago. I'm at a very conservative, very large international entity. I haven't gotten the go-ahead to use CLR yet. I'll look into SQL#, but the approval process for outside software, especially at the database server level, makes this a non-starter. – Evan Jan 08 '15 at 02:16
  • @Evan Is the PK clustered? It sounds like direct-to-table importing and using temp tables are not options if you want to keep the source data around for a week or two. Is there an index on `LoadFileID`? When moving the data from this table to the "real" tables, you reference by `LoadFileID`, right? Do you ever reference rows by the PK field? I would think there isn't much reason to. In either case, the `CLUSTERED UNIQUE INDEX` should be on `(LoadFileID, PKfield)` and the PK should be NonClustered. And regular reorg is good, but is it ever rebuilt? – Solomon Rutzky Jan 08 '15 at 02:18
  • @Evan Understood about SSIS and SQL#, though custom CLR should still be an option but might not be needed. I am adding another suggestion to my answer. I do see that you have already accepted one, but you might want to consider this ;-). – Solomon Rutzky Jan 08 '15 at 02:31
  • srutzky - The Identity PK is clustered. There is an index on LoadFileId, which is referenced for the move into the real tables. There is infrastructure for moving generic text formats in and out of the database. The PK field is rarely used, mostly just passed around for audit and prod support purposes. Due to the nature of the generic import tables (there are a variety of import types, not just csv) there really isn't much difference between (Identity,LoadFileID) and (LoadFileId,Identity), but I'll think about your suggestion. There usually isn't much interleaving of files in the tables. – Evan Jan 09 '15 at 00:03
  • srutzky - I'd love to use CLR. But the org isn't allowing it yet. I accepted the other answer because it answered my direct question better, but I've had more thought out of your answers. For some reason, Stackoverflow won't let me put a @ before your name. – Evan Jan 09 '15 at 00:08
  • 4) This framework is generally used for large files (it came about because Biztalk was taking 6 hours to process files). For the import, I really don't know the destination fields, the framework is geared to generically import and export a variety of text files, print streams, emails, etc. Once the raw data is in the database, then the process specific code takes over. At some point, there will probably be a metadata layer that will handle moving the data around in the database. – Evan Jan 09 '15 at 00:13
  • 4 continued) XML would significantly complicate the process. It's also a matter of timing, the load of the text files takes long enough as it is, significantly expanding the amount of data with xml tags would stretch out the process even more. Moving the data from the generic import tables to the 'real' tables is fairly easy. Basically, I can get any file into the database without knowing anything about the format. It's very handy to load file formats we haven't seen yet to use the database to investigate production issues in other systems. – Evan Jan 09 '15 at 00:16
  • @Evan You only need to specify a username if that person did not create the thing (Q or A) you are commenting on. 1) Understood about "not much interleaving of LoadFileIDs" but would be surprised if you didn't see some improvement from making the index changes I mentioned. 2) What other import types? What do you gain from splitting into fields as opposed to loading the line into a single VARCHAR(MAX) and then split into the real table? 3) I assumed your process could know of the end table since you have code in the DB that knows it. But that is a non-issue if not going the XML route. – Solomon Rutzky Jan 09 '15 at 02:52
  • @Evan 4) Not sure if I am misunderstanding, but in a comment on the Q you said "The importing of the text into the database isn't the issue" yet here you say "the load of the text files takes long enough as it is". If you're just inserting and not processing per row, 1 million rows shouldn't take long to split and upload. How are you doing the load? If not TVP I know how to greatly speed up the upload :). 5) If there are concerns about CLR, please see my article http://www.sqlservercentral.com/articles/SQLCLR/109905/ if not the whole series. 6) How big are the files? Size of the longest line? – Solomon Rutzky Jan 09 '15 at 03:10
  • In addition to the CSV, we are loading text files (as whole lines into another generic import table (with a single VARCHAR(MAX)), print streams (into the text file import table, but with line/page info, fixed format. At least for now. – Evan Jan 16 '15 at 05:38
  • Currently, we're generating INSERT statements. Now that we can use the .Net frameworks from 3.5, I'll look into TVP. How else do you speed up the upload? The files can be up to 5000 chars wide, and up to a million rows (or more). As to using CLR, not being allowed to use it is a corporate decision, I'd like to use it but the powers-that-be say no (at least for now). – Evan Jan 16 '15 at 05:42
  • @Evan 1) If you have a generic import table for the whole-line text files, why not load the csv lines into there and then split? 2) Generating parameterized INSERT statements? 3) Why not call a stored proc? 4) Are you wrapping any number of the INSERTs, as a batch, around a transaction? 5) how can 5000 chars not fit into an 8060 char row? still not sure how the wide table wouldn't work given that none of the fields would overflow. 6) TVPs would be incredibly fast in comparison to what you are currently doing. See my answer here: http://stackoverflow.com/a/25773471/577765 – Solomon Rutzky Jan 16 '15 at 06:20
  • 1
    6) Using TVP for the CSV file halfed the load time! 1) Parsing the CSV fields in non-CLR SQL is slow. 4) The Inserts are being batched. 5) Sorry, I meant 5000 fields. – Evan Jan 16 '15 at 07:04
  • @Evan Also, if one line can be up to 5000 fields, that is not all going into one table, correct? The destination is multiple tables? And what is the max (approximately) length of a single line, in characters? – Solomon Rutzky Jan 16 '15 at 21:41
  • Sorry, I don't have enough rep to upvote yet... I lost my other account, so I've had to create a new one. – Evan Jan 17 '15 at 01:19
  • Due to the limitations in SQL Server, the data from the line will eventually go into many tables. Unfortunately, due to the 3rd party nature of the files, there is literally no limit to length of a single line. For practical reasons, I'm programming the generic handlers to not care how long a line is (other then the VARCHAR(MAX) limits, and if that gets breached, which I doubt, I'll deal with it then). – Evan Jan 17 '15 at 01:23
1

yes. large records take up more space on disk and in memory, which means loading them is slower than small records and fewer can fit in memory. both effects will hurt performance.

Jasen
  • 11,837
  • 2
  • 30
  • 48
  • Regardless of the number of fields in the generic field import tables, the data is exactly the same size. So, a 20 field file, loaded into the 20 field table would consume all 20 fields. The same file, loading into a 1000 field table would consume 20 VARCHAR(MAX) fields, with the other 980 VARCHAR(MAX) fields being NULL, but other than these NULLS, the amount of data is exactly the same. – Evan Jan 06 '15 at 03:39
  • yeah empty fields take no space, but full but ones do even if unneccessary to the current query.. – Jasen Jan 06 '15 at 04:01
  • @Jasen: Not quite true. NULL values takes some space too. One byte, IIRC. – James Jan 06 '15 at 04:40
  • and since NULLs take very little space, an extra 980 fields with NULL data should not increase storage requirements, and therefore should not impact performance from an I/O perspective. (not in respose to your comment @James - I only just saw it) – Nick.Mc Jan 06 '15 at 04:40
1

As was correctly pointed out in comments, even if your table has 1000 columns, but most of them are NULL, it should not affect performance much, since NULLs will not waste a lot of space.

You mentioned that you may have real data with 900-1000 non-NULL columns. If you are planning to import such files, you may come across another limitation of SQL Server. Yes, the maximum number of columns in a table is 1024, but there is a limit of 8060 bytes per row. If your columns are varchar(max), then each such column will consume 24 bytes out of 8060 in the actual row and the rest of the data will be pushed off-row:

SQL Server supports row-overflow storage which enables variable length columns to be pushed off-row. Only a 24-byte root is stored in the main record for variable length columns pushed out of row; because of this, the effective row limit is higher than in previous releases of SQL Server. For more information, see the "Row-Overflow Data Exceeding 8 KB" topic in SQL Server Books Online.

So, in practice you can have a table with only 8060 / 24 = 335 nvarchar(max) non-NULL columns. (Strictly speaking, even a bit less, there are other headers as well).

There are so-called wide tables that can have up to 30,000 columns, but the maximum size of the wide table row is 8,019 bytes. So, they will not really help you in this case.

Vladimir Baranov
  • 31,799
  • 5
  • 53
  • 90
  • Thanks. It turns out I can have about 330 fields before problems ensue. I've set up the final table as 100 fields, with a sequence indicator so it can handle field numbers to any length. The 100 field size, while smaller than will fit in a row, makes using the fields in SQL code more human friendly. – Evan Jan 08 '15 at 02:19
  • @Evan If I were in your place, I would seriously reconsider the design. For example, instead of having hundreds of columns that have same type I would have a table with __one__ column with nvarchar(max) type and add a second column that would hold something that identifies the order in which data should be processed. In your terms it would hold "F001", "F002", "Fnnn". Usually a simple number would be enough. Thus you can have any number of columns in your source data and processing of the imported data would most likely become easier. – Vladimir Baranov Jan 08 '15 at 02:57
  • @Evan, of course, as usual, it all depends on the nature of the data and how you intend to use it later. – Vladimir Baranov Jan 08 '15 at 03:00
  • @Evan and Vladimir: I don't see why a wide-table wouldn't work. I would assume that the import data is already most likely within the 8019 byte limit. The 24 bytes for overflow is only IF the data in fact over-flows. Data that doesn't is kept on the data page and an empty field (`NULL` or empty string) takes up 0 bytes (per row). Also, I just tested with 1000 `VARCHAR(MAX) NOT NULL` fields, all defaulted to a single space (not empty), plus `[RowID] INT IDENTITY PK, [LoadFileID] INT NOT NULL`, and had no issues inserting several rows. – Solomon Rutzky Jan 08 '15 at 03:02
  • Vladimir, the single-field approach is essentially what I just proposed, though I suggested XML, figuring the ability to parse it would outweigh the increase in data size. – Solomon Rutzky Jan 08 '15 at 03:03
  • @srutzky, If I understood your suggestion correctly, it is not what I suggest here. You suggest to have one row in the destination table to hold data from all source columns of that row in a single big XML. I suggest to have many rows in the destination table. Each destination row would contain a value from one source column. If the source data has one row with 1000 source columns, out of which 600 are not NULL there will be 600 rows in the destination table. – Vladimir Baranov Jan 08 '15 at 03:15
  • Ah, ok. I see now. The structure would be more like: `ID, LoadFileID, RowID, FieldID, FieldValue VARCHAR(MAX)`, right? – Solomon Rutzky Jan 08 '15 at 03:22
  • @srutzky That's right. It allows unlimited number of columns in the source data. You may reduce storage requirements by choosing not to store NULL values at all. Whether this structure is good or not depends on the nature of the data and how it will be used. – Vladimir Baranov Jan 08 '15 at 03:35
  • @Vladimir Baranov I've done this in the past, and it doesn't work in this situation. Generally the destination is a table that has the same layout as the imported file. Storing the fields vertically makes moving them to a horizontal table difficult. As things now stand, I sort have this, but in 100 field chunks. As there is a sequence number in the fields, a 'line' can have any number of columns. There are some 3rd party files that will contain thousands of columns. – Evan Jan 09 '15 at 00:23
  • @srutzky wide table has the same problem as normal tables, as soon as you get above 330 fields with long data in them (say 25 chars) the line becomes too long because of the 24 byte pointers to the data in the overflow table. If have files that will definitely overflow the table. – Evan Jan 09 '15 at 00:26
  • @Evan If you can't fit a row of the import file into a single row of the import table, then how does it fit into a single row of the destination table? Or are you saying that a single row from the source file can end up being spread across several tables? – Solomon Rutzky Jan 09 '15 at 00:38
  • Yes, the data will have to go into several tables. The really long CSV lines are a structured format, with many variable/optional parts. Like I said, 3rd party... – Evan Jan 16 '15 at 05:36