0

I've got a file named MovieID_NameID_Roles.txt that is 1,767,605 KBs.

I need to loop through it to parse and then populate a Database Table.

Wanting to work with several small files rather than that one humongous one, I found this answer to a question on how to split large text files.

Based on the accepted answer, which says:

into files with 10000 lines each: split myLargeFile.txt -l 10000

...but down at the bottom of the second screen shot gives what appeared to me to be a "fancier" version of this command, with some niceties thrown in:

split MovieID_NameID_Roles.txt MySlice -1 10000 -a 5 -d

So, I downloaded and installed Git/Bash, and ran this in it:

split MovieID_NameID_Roles.txt MySlice -1 10000 -a 5 -d

But rather than split my very large file into files of 10,000 lines each, as I was expecting (or at least hoping), it generated files named 1000000000 through 1000099999, each one only 1KB in size; and then the splitting stopped working, with an error message "output file suffixes exhausted.":

enter image description here

So what is the command I should use to split my file into smaller files of 10,000 lines each?

Jens
  • 69,818
  • 15
  • 125
  • 179
B. Clay Shannon-B. Crow Raven
  • 8,547
  • 144
  • 472
  • 862
  • 1
    `with an error message that it had run out of extension numbers.` Please post the full error message? So maybe if it run out of numbers, so maybe give it bigger `-a`? What is `-1` to `split`? Did you meant `-l`? – KamilCuk Sep 18 '20 at 07:44
  • 1
    It's *faster* to import a single big CSV/flat file to a database in a single bulk operation than 1000 smaller ones. All major databases have bulk import commands that don't just load the data in a streaming fashion, they also use minimal logging and bulk processes that reduce the overhead of the operation. Which database are you using? Which command? – Panagiotis Kanavos Sep 18 '20 at 07:46
  • @PanagiotisKanavos: I haven't written the database code yet; I will be using MS SQL Server. I have to filter and sort the data myself first, not just dump it all into tables. IOW, there is nothing near a 1:1 relationship between what the files contain and what the tables will contain. – B. Clay Shannon-B. Crow Raven Sep 18 '20 at 07:48
  • 1
    I think you misread `-l` (lowercase `-L`) and typed `-1` (minus one) – LeGEC Sep 18 '20 at 07:49
  • 1
    It's also a **lot** faster to import into a staging table that no other client is locking than directly into a production table. It's faster to disable indexes during bulk insert operations and rebuild them later. It's also faster to insert the data in the order they'll be indexed, reducing the work needed to update or rebuild indexes. None of these optimizations will work with several small files – Panagiotis Kanavos Sep 18 '20 at 07:49
  • 1
    @B.ClayShannon SQL Server - don't write anything at all then. SQL Server has `bcp` to import flat files from the command line, `BULK INSERT` to do the same from T-SQL and SSIS to transform and import complex data from a multitude of sources in a streaming fashion. You can load everything into a table that matches the production table and just swap the staging and production tables without any delay, using [partition switching](https://www.sqlservercentral.com/articles/keeping-fact-tables-online-while-loading-via-partition-switching) – Panagiotis Kanavos Sep 18 '20 at 07:51
  • 1
    For *really big tables* stored on Hadoop, Azure, Teradata, Kafka etc you can use [External Tables](https://learn.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-ver15#examples) and Polybase to pull the data directly from the source, without exporting it to flat files first. With Hadoop etc you can even have Hadoop process the query – Panagiotis Kanavos Sep 18 '20 at 07:57
  • 1
    Where does the data come from? Is there a direct connection between it and SQL Server? – Panagiotis Kanavos Sep 18 '20 at 07:58
  • @PanagiotisKanavos: No worries, nobody is using these tables. They aren't even created yet, and when they are, I will be the only person using them. Others will only query (never update) them. – B. Clay Shannon-B. Crow Raven Sep 18 '20 at 08:38
  • @PanagiotisKanavos: The data comes tab-delimited-values; no SQL Server "connection" (no pun intended) – B. Clay Shannon-B. Crow Raven Sep 18 '20 at 08:40
  • 1
    @B.ClayShannon don't split then. Another thing that can seriously affect performance, is the database file size. Importing 1B rows will cause a lot of resizing. It's better to increase the file size in advance than have the server increase it little by little as extra space is needed. Piecemeal increases cost not just in IO bandwidth, they also cause index and file fragmentation. You can also enable table compression on the table - IO causes most delays in a database while CPUs typically sit idle. Reducing IO can improve performance significantly – Panagiotis Kanavos Sep 18 '20 at 09:18
  • 1
    @B.ClayShannon at 1Μ rows, partitioning may be important too, from a data management perspective, not perf. You can load data for a specific partition in a staging table and move just that partition to a production table. The production table will remain online except for a slight delay during partition switching. That's just a metadata operation, it doesn't involve moving any data. – Panagiotis Kanavos Sep 18 '20 at 09:37
  • 1
    @B.ClayShannon yet another option that auto"magically" combines all other options is to use [clustered columnstore indexes](https://www.red-gate.com/simple-talk/sql/sql-development/what-are-columnstore-indexes/). This combines columnar storage, in-memory processing, compression and automatic partitioning. Buckets (partitions) are created automatically for every 1M rows, so your data size is on the low end. Analytical queries and aggregates become a *lot* faster (x100) due to the columnar format and in-mem processing' – Panagiotis Kanavos Sep 18 '20 at 09:40

1 Answers1

3

Looks like you are using "-1" instead of "-l". Which is causing to generate multiple file of single line each.

Command Should be:

$ split MovieID_NameID_Roles.txt MySlice -l 10000 -a 5 -d