3

I am trying to import a text file, so the result would be just words in a seperate rows of one column. For example a text:

'Hello Mom,

we meet again'

should give 5 records:

'Hello' 
'Mom,'
'we' 
'meet' 
'again'

I tried to accomplish this with BULK INSERT with ROWTERMINATOR = ' ', but there is a problem with treating new line as a terminator too and I get 'Mom,we' in one of the results.

From what i know, there is no way to add a second ROWTEMRMINATOR to BULK INSERT (true?). What is the best way you know to achieve the result as specified above?

The file cannot be preprocessed outside of SQL Server and the method should be useful for hundreds of files with thausands lines of words, imported at a different times, not just once.

Solomon Rutzky
  • 46,688
  • 9
  • 128
  • 171
user3853657
  • 239
  • 4
  • 13
  • 1
    I think using `BULK INSERT` for this word parsing task is not the best option. – peter.petrov Jan 01 '15 at 12:13
  • there are many ways to "reformat" the file so bulk insert to work on the source machine. use of one these tools to massage the data. trying to do it with BULK INSERT is a mistake. – Hogan Jan 01 '15 at 15:21
  • Can you preprocess the file? A few lines of C# are enough for this job. – usr Jan 01 '15 at 15:41
  • Thanks for your comments. I've added some details to clarify. I know BULK INSERT is not the best option, the question is what is? ;) – user3853657 Jan 01 '15 at 19:57
  • Dear usr, I added details to the question earlier ;) "The file cannot be preprocessed outside of SQL Server" – user3853657 Jan 01 '15 at 20:55
  • I'm intrigued as to why using an external parser is not possible? – Martin Brown Jan 01 '15 at 21:17

1 Answers1

2

Given:

The file cannot be preprocessed outside of SQL Server

Option 1

Why not use OPENROWSET(BULK...)? This would allow you to import/insert (which takes care of the row terminator) while at the same time splitting (which takes care of the field terminator). Depending on whether or not you can create a Format File, it should look something like one of the following:

Format File = split each row

INSERT INTO dbo.TableName (ColumnName)
  SELECT split.SplitVal
  FROM   OPENROWSET(BULK 'path\to\file.txt',
                    FORMATFILE='Path\to\FormatFile.XML') data(eachrows)
  CROSS APPLY SQL#.String_Split(data.eachrow, N' ', 2) split;

No Format File = split entire file as a single row

INSERT INTO dbo.TableName (ColumnName)
  SELECT split.SplitVal
  FROM   OPENROWSET(BULK 'path\to\file.txt', SINGLE_CLOB) data(allrows)
  CROSS APPLY SQL#.String_Split(
                                REPLACE(data.allrows, NCHAR(10), N' '),
                                N' ',
                                2 -- remove empty entries
                               ) split;

Notes:

  • For both methods you need to use a string splitter. SQLCLR-based splitters are the fastest and in the examples above I used one from the SQL# library (which I created but the String_Split function is available in the Free version). You can also write your own. If you do write your own and are not using a Format File, it might be a good idea to allow for multiple split characters so you can pass in both " " and "\n" and get rid of the REPLACE().

  • If you can write your own SQLCLR string splitter, then it might be even better to just write a SQLCLR stored procedure that accepts an input parameter for @FilePath, reads the file, does the splitting, and spits out the words as many rows of a single column:

    INSERT INTO dbo.TableName(ColumnName)
      EXEC dbo.MySQLCLRproc(N'C:\path\to\file.txt');
    
  • If you are not using (or cannot use) a Format File, then be sure to use the proper "SINGLE_" option as you can do either SINGLE_CLOB (returns VARCHAR(MAX) for standard ASCII file) or SINGLE_NCLOB (returns NVARCHAR(MAX) for Unicode file).

  • Even if you can create a Format File, it might be more efficient to pull in the entire file as a single string, depending on the size of the files, as splitting a large string can be done rather quickly and would be a single function call, whereas a file of thousands of short lines would be thousands of function calls that are also fast, but likely not 1000 times faster than the single call. But if the file is 1 MB or larger then I would probably still opt for doing the Format File and processing as many short lines.

Option 2

If by "pre-processed" you mean altered, but that there is no restriction on simply reading them and inserting the data from something external to SQL Server, you should write a small .NET app which reads the rows, splits the lines, and inserts the data by calling a Stored Procedure that accepts a Table-Valued Parameter (TVP). I have detailed this approach in another answer here on S.O.:

How can I insert 10 million records in the shortest time possible?

This could be compiled as a Console App and used in batch (i.e. .CMD / .BAT) scripts and even scheduled as a Windows Task.

Community
  • 1
  • 1
Solomon Rutzky
  • 46,688
  • 9
  • 128
  • 171