processing huge utf8 files with splitting to multiple files

Question

i am developing a importer program for importing large text utf8 (character bytes are different) files in C#. if i load all the 20GB file to RAM, this solution is not suitable and possible. it's better to split file to multiple smaller files to process. Now, my problem is splitting the file foe processing. my solution is reading the file line by line and split them if the lines number is my suitable number. but i think, it is not fast solution to read the file line by line for splitting. splitting time is high. is there a algorithm for splitting large utf8 files to multiple files without reading line by line and faster.

No. There no other way to split files (at least in Windows) than read whole source and write all destination files. You can do minor optimizations (one would need to see code for recommendations), but limited by the fact you need to transfer 2x size of file from/to disks. — Alexei Levenkov, Nov 06 '16 at 05:54
thanks for your comment, approximately i spent about 10 hours for finding the answer of my question, but i have not found the answer, i think your comment help me to make a decision. in utf-8, i have no solution without reading line by line tp split the file, and may be splitting was not a good solution for that. — user2352554, Nov 06 '16 at 06:48

score 0 · Accepted Answer · answered Nov 06 '16 at 05:59

0

My suggestions for your problem is as below. This I thought keeping in mind of separation of concern, as splitting of file and processing of file can be separated for better maintenance.

Read the file in binary rather than text
Do not read line by line as you don't require reading the file for splitting.
Use seek. Refer link.
In case you need to save the split-ted files with complete lines, then after you seek to position, search for next end of line character and then split file accordingly.
Once files are split-ted, process the files individually.

answered Nov 06 '16 at 05:59

Mukul Varshney

3,131
1
12
19

thanks for your response, but, you know, my problem is determining a number that dose not split a character, for example, utf-8 is variable length, and i dont know that the 1000's byte is the end of character or is the middle of character, because in utf-8 a char can take to 4 byte space. – user2352554 Nov 06 '16 at 06:45
Yeh I missed utf-8 encoding. Anyway, in this case once you seek to any random position, you need to check if its middle of a character or not. characterStartDetector = (pos, data) => (data & 0x80) == 0 || (data & 0x40) != 0; Refer http://stackoverflow.com/questions/452902/how-to-read-a-text-file-reversely-with-iterator-in-c-sharp. Also refer http://stackoverflow.com/questions/31008038/opening-inputstreamreader-in-the-middle-of-utf-8-stream. 2nd link is java e.g. Its not hard to understand use it in C# – Mukul Varshney Nov 06 '16 at 07:29

processing huge utf8 files with splitting to multiple files

1 Answers1