0

I read the file line by line and insert this data to DB using entity framework. The reading is very slow. The file is almost 6 millions of rows and I need encrease the perfomance of reading of the file. It is a dictionary of words in this file and I need to insert these words in database table. Below are several lines of that file.

390201
ТАТАМИ  NOUN,inan,neut,Fixd sing,nomn
ТАТАМИ  NOUN,inan,neut,Fixd sing,gent
ТАТАМИ  NOUN,inan,neut,Fixd sing,datv
ТАТАМИ  NOUN,inan,neut,Fixd sing,accs
ТАТАМИ  NOUN,inan,neut,Fixd sing,ablt
ТАsing,gent
ОРИГАМИ NOUN,inan,neut,Fixd ТАМИ    NOUN,inan,neut,Fixd sing,loct
ТАТАМИ  NOUN,inan,neut,Fixd plur,nomn
ТАТАМИ  NOUN,inan,neut,Fixd plur,gent
ТАТАМИ  NOUN,inan,neut,Fixd plur,datv
ТАТАМИ  NOUN,inan,neut,Fixd plur,accs
ТАТАМИ  NOUN,inan,neut,Fixd plur,ablt
ТАТАМИ  NOUN,inan,neut,Fixd plur,loct

390202
ОРИГАМИ NOUN,inan,neut,Fixd sing,nomn
ОРИГАМИ NOUN,inan,neut,Fixd sing,datv
ОРИГАМИ NOUN,inan,neut,Fixd sing,accs
ОРИГАМИ NOUN,inan,neut,Fixd sing,ablt
ОРИГАМИ NOUN,inan,neut,Fixd sing,loct
ОРИГАМИ NOUN,inan,neut,Fixd plur,nomn
ОРИГАМИ NOUN,inan,neut,Fixd plur,gent
ОРИГАМИ NOUN,inan,neut,Fixd plur,datv
ОРИГАМИ NOUN,inan,neut,Fixd plur,accs

My code for parsing of that file is below:

public static void parseFileFromToSegment(int beginId, int endId)
    {
    using (var db = new Context())
    {
        string theWordFromFile;
        string wordData;
        int wordIdFromFile = 1;
        int tempWordId;

        IEnumerable<string> allFileLines = File.ReadLines(fileName);
        allFileLines = allFileLines.SkipWhile(n => n != beginId.ToString());
        foreach (string line in allFileLines)
        {
            if (string.IsNullOrEmpty(line))
                continue;
            if (!string.IsNullOrEmpty(line) && Int32.TryParse(line, out tempWordId))
            {
                if (tempWordId < beginId)
                {
                    continue;
                }
                if (tempWordId > endId) 
                    break;

                wordIdFromFile = tempWordId;
                if (wordIdFromFile % 100 == 0)
                    Console.WriteLine("Current id - " + wordIdFromFile);
                continue;
            }

            theWordFromFile = line.Substring(0, line.IndexOf('\t'));
            wordData = line.Substring(line.IndexOf('\t')).Trim();
            TheWord theWord = new TheWord { WordFormId = wordIdFromFile, word = theWordFromFile, word_form_data = wordData };

            db.TheWords.Add(theWord);
        }
        db.SaveChanges();
        Console.WriteLine("saved");
    }
}

So the speed of reading is very slow. What can I do to improve performance? Thank you

Alexey Rumin
  • 193
  • 1
  • 6
  • 17
  • 1
    Are you sure the slow performance doesn't come from adding 6 million records to EF in a single `SaveChanges()`? – Jcl Mar 11 '16 at 09:01
  • 4
    @SimonKarlsson He is already doing it. `File.ReadLines` returns a `IEnumerable` and underneath uses `StreamReader.ReadLine()` – xanatos Mar 11 '16 at 09:01
  • 1
    This piece `!string.IsNullOrEmpty(line) &&` is useless. You already checked the opposite the line before. But it won't change the speed – xanatos Mar 11 '16 at 09:02
  • Try calling `SaveChanges()` inside the loop - but not for every iteration, of course. You can save each 1000 rows – Denis Itskovich Mar 11 '16 at 09:03
  • @DenisItskovich although that'd help, if performance is what I'm looking after, I'd just not use EF at all for a 6 million row bulk-insert operation – Jcl Mar 11 '16 at 09:04
  • Why do you think the *reading* is the bottleneck? Did you profile the code? What does "too slow" mean anyway? Does it take 200ms? 10 seconds? 10 minutes? How does the table you're inserting into look? What DB engine are you using? Also, you can't "seek" in a text file, so finding the `beginId` always means reading every single character preceding that `beginId` - depending on your usage of the method, this can result in a fairly typical Schlemiel. – Luaan Mar 11 '16 at 09:04
  • save changes is not a problem. It works fast. Reading the file takes many hours. – Alexey Rumin Mar 11 '16 at 09:10
  • @AlexeyRumin Maybe `db.TheWords.Add(theWord);` is causing the problem, everytime you call that, EF calls `DetectChanges`. Can you try adding `db.Configuration.AutoDetectChangesEnabled = false; ` right after you create the context? – Alexander Derck Mar 11 '16 at 09:14
  • 1
    @Alexander Derck You are right, thank you. I created a List for "theWord" objects and used List.AddRange(list with theWord objects) methd - the probleb is solved - it was in detecting changes of EF. – Alexey Rumin Mar 11 '16 at 09:29
  • @AlexeyRumin You're welcome! – Alexander Derck Mar 11 '16 at 09:33

2 Answers2

3

It's not the file reads that are slow. It's the DB inserts.

You could use pure ADO.NET with a DataAdapter to insert the rows (using batching) or the SQLBulkCopy class (example).

Community
  • 1
  • 1
jgauffin
  • 99,844
  • 45
  • 235
  • 372
  • savechanges is not the problem - the saving will be done in this method after 20 000 lines are read. But the reading of them is very very small. And when they are read EF works fine - saves them very fast – Alexey Rumin Mar 11 '16 at 09:09
  • It's easy to test that. Simply comment out all Entity Framework code and run the sample again. Is it fast? – jgauffin Mar 11 '16 at 09:13
  • I commented - it still works very slow. It will take many hours (20-30) to read the entire file – Alexey Rumin Mar 11 '16 at 09:16
  • 1
    Was that without the database code? I don't care about `SaveChanges()`. Did you test it without ALL database code? Because it's unlikely that a DB insert is faster than a file read unless the file is read over a very slow network share. – jgauffin Mar 11 '16 at 09:23
  • you were right - the problem was in EF. As Alexander Derck told in comment - "Maybe db.TheWords.Add(theWord); is causing the problem, everytime you call that, EF calls DetectChanges." I added a list for adding the "theWord" objects and used method AddRange for adding objects. After that saved shanges. Works perfectly. Thank you for your help! – Alexey Rumin Mar 11 '16 at 09:31
  • @AlexeyRumin that's also not your only problem, it's absurd to open the file and read all previous lines just to skip ahead... open the file outside your function and don't reopen it on every batch iteration – Jcl Mar 11 '16 at 09:33
  • @Jcl Before it I tried to solve the problem using multithreading - each thread had to open the file and parse from defenite line. So I just didn't deleted. Before it I opened file only once. Thanky for writing about this problem in code – Alexey Rumin Mar 11 '16 at 09:40
  • 1
    @AlexeyRumin as a future advice: I/O-bound processes on the same file (actually, on the same physical disk) *never* get better with multithreading. CPU-bound processes might. – Jcl Mar 11 '16 at 09:43
  • @Jcl Thank you for advice! – Alexey Rumin Mar 11 '16 at 09:50
3

Reading all the comments, it seems you are calling parseFileFromToSegment every 20.000 "ids", which are probably (according to your sample text) many lines for each id.

So you are calling your parseFileFromToSegment and doing this:

IEnumerable<string> allFileLines = File.ReadLines(fileName);
allFileLines = allFileLines.SkipWhile(n => n != beginId.ToString());

On every call: that reads from the start, potentially millions of already read lines in the file every time you call it.

Try to make that call just one time and see if it's faster, and if you want to batch-save every 'n' records, then do so, don't open and read [potentially] the whole file every single iteration of 'n'

Jcl
  • 27,696
  • 5
  • 61
  • 92