6

Conclusion
Problem closed, I think.
Looks like the problem had nothing to do with the methodology, but that the XCode did not clean the project correctly in between builds.
It looks like after all those tests, the sqlite file that was being used was still the very first one that wasn't indexed......
Beware of XCode 4.3.2, I have nothing but problems with Clean not cleaning, or adding files to project not automatically being added to the bundle resources...
Thanks for the different answers..

Update 3
Since I invite anybody to just try the same steps to see if they get the same results, let me detail what I did:
I start with blank project
I defined a datamodel with one Entity, 3 attributes (2 strings, 1 float)
The first string is indexed
enter image description here

In did finishLaunchingWithOptions, I am calling:

[self performSelectorInBackground:@selector(populateDB) withObject:nil];

The code for populateDb is below:

-(void)populateDB{
NSLog(@"start");
NSPersistentStoreCoordinator *coordinator = [self persistentStoreCoordinator];
NSManagedObjectContext *context;
if (coordinator != nil) {
    context = [[NSManagedObjectContext alloc] init];
    [context setPersistentStoreCoordinator:coordinator];
}

NSString *filePath = [[NSBundle mainBundle] pathForResource:@"input" ofType:@"txt"];  
if (filePath) {  
    NSString * myText = [[NSString alloc]
                               initWithContentsOfFile:filePath
                               encoding:NSUTF8StringEncoding
                               error:nil];
    if (myText) {
        __block int count = 0;


        [myText enumerateLinesUsingBlock:^(NSString * line, BOOL * stop) {
            line=[line stringByReplacingOccurrencesOfString:@"\t" withString:@" "];
            NSArray *lineComponents=[line componentsSeparatedByString:@" "];
            if(lineComponents){
                if([lineComponents count]==3){
                    float f=[[lineComponents objectAtIndex:0] floatValue];
                    NSNumber *number=[NSNumber numberWithFloat:f];
                    NSString *string1=[lineComponents objectAtIndex:1];
                    NSString *string2=[lineComponents objectAtIndex:2];
                    NSManagedObject *object=[NSEntityDescription insertNewObjectForEntityForName:@"Bigram" inManagedObjectContext:context];
                    [object setValue:number forKey:@"number"];
                    [object setValue:string1 forKey:@"string1"];
                    [object setValue:string2 forKey:@"string2"];
                    NSError *error;
                    count++;
                    if(count>=1000){
                        if (![context save:&error]) {
                            NSLog(@"Whoops, couldn't save: %@", [error localizedDescription]);
                        }
                        count=0;

                    }
                }
            }



        }];
        NSLog(@"done importing");
        NSError *error;
        if (![context save:&error]) {
            NSLog(@"Whoops, couldn't save: %@", [error localizedDescription]);
        }

    }  
}
NSLog(@"end");
}

Everything else is default core data code, nothing added.
I run that in the simulator.
I go to ~/Library/Application Support/iPhone Simulator/5.1/Applications//Documents
There is the sqlite file that is generated

I take that and I copy it in my bundle

I comment out the call to populateDb

I edit persistentStoreCoordinator to copy the sqlite file from bundle to documents at first run

- (NSPersistentStoreCoordinator *)persistentStoreCoordinator 
{
@synchronized (self)
{
    if (__persistentStoreCoordinator != nil)
        return __persistentStoreCoordinator;

    NSString *defaultStorePath = [[NSBundle mainBundle] pathForResource:@"myProject" ofType:@"sqlite"];
    NSString *storePath = [[[self applicationDocumentsDirectory] path] stringByAppendingPathComponent: @"myProject.sqlite"];

    NSError *error;
    if (![[NSFileManager defaultManager] fileExistsAtPath:storePath]) 
    {
        if ([[NSFileManager defaultManager] copyItemAtPath:defaultStorePath toPath:storePath error:&error])
            NSLog(@"Copied starting data to %@", storePath);
        else 
            NSLog(@"Error copying default DB to %@ (%@)", storePath, error);
    }

    NSURL *storeURL = [NSURL fileURLWithPath:storePath];

    __persistentStoreCoordinator = [[NSPersistentStoreCoordinator alloc] initWithManagedObjectModel:[self managedObjectModel]];

    NSDictionary *options = [NSDictionary dictionaryWithObjectsAndKeys:
                             [NSNumber numberWithBool:YES], NSMigratePersistentStoresAutomaticallyOption,
                             [NSNumber numberWithBool:YES], NSInferMappingModelAutomaticallyOption, nil];

    if (![__persistentStoreCoordinator addPersistentStoreWithType:NSSQLiteStoreType configuration:nil URL:storeURL options:options error:&error]) 
    {

        NSLog(@"Unresolved error %@, %@", error, [error userInfo]);
        abort();
    }    

    return __persistentStoreCoordinator;
}    
}


I remove the app from the simulator, I check that ~/Library/Application Support/iPhone Simulator/5.1/Applications/ is now removed
I rebuild and launch again
As expected, the sqlite file is copied over to ~/Library/Application Support/iPhone Simulator/5.1/Applications//Documents

However the size of the file is smaller than in the bundle, significantly! Also, doing a simple query with a predicate like this predicate = [NSPredicate predicateWithFormat:@"string1 == %@", string1]; clearly shows that string1 is not indexed anymore

Following that, I create a new version of the datamodel, with a meaningless update, just to force a lightweight migration
If run on the simulator, the migration takes a few seconds, the database doubles in size and the same query now takes less than a second to return instead of minutes.
This would solve my problem, force a migration, but that same migration takes 3 minutes on the iPad and happens in the foreground.
So hat's where I am at right now, the best solution for me would still be to prevent the indexes to be removed, any other importing solution at launch time just takes too much time.
Let me know if you need more clarifications...

Update 2
So the best result I have had so far is to seed the core data database with the sqlite file produced from a quick tool with similar data model, but without the indexes set when producing the sqlite file. Then, I import this sqlite file in the core data app with the indexes set, and allowing for a lightweight migration. For 2 millions record on the new iPad, this migration stills take 3 minutes. The final app should have 5 times this number of records, so we're still looking at a long long processing time. If I go that route, the new question would be: can a lightweight migration be performed in the background?

Update
My question is NOT how to create a tool to populate a Core Data database, and then import the sqlite file into my app.
I know how to do this, I have done it countless times.
But until now, I had not realized that such method could have some side effect: in my case, an indexed attribute in the resulting database clearly got 'unindexed' when importing the sqlite file that way.
If you were able to verify that any indexed data is still indexed after such transfer, I am interested to know how you proceed, or otherwise what would be the best strategy to seed such database efficiently.

Original

I have a large CSV file (millions of lines) with 4 columns, strings and floats. This is for an iOS app.

I need this to be loaded into core data the first time the app is loaded.

The app is pretty much non functional until the data is available, so loading time matters, as a first time user obviously does not want the app to take 20 minutes to load before being able to run it.

Right now, my current code takes 20 min on the new iPad to process a 2 millions line csv file.

I am using a background context to not lock the UI, and save the context every 1,000 records

The first idea I had was to generate the database on the simulator, then to copy/paste it in the document folder at first launch, as this is the common non official way of seeding a large database. Unfortunately, the indexes don't seem to survive such a transfer, and although the database was available after just a few seconds, performance is terrible because my indexes were lost. I posted a question about the indexes already, but there doesn't seem to be a good answer to that.

So what I am looking for, either:

  • a way to improve performance on loading millions of records in core data
  • if the database is pre-loaded and moved at first startup, a way to keep my indexes
  • best practices for handling this kind of scenario. I don't remember using any app that requires me to wait for x minutes before first use (but maybe The Daily, and that was a terrible experience).
  • Any creative way to make the user wait without him realizing it: background import while going through tutorial, etc...
  • Not Using Core Data?
  • ...
RLH
  • 15,230
  • 22
  • 98
  • 182
JP Hribovsek
  • 6,707
  • 2
  • 21
  • 26
  • So how did you end up "cleaning" the project so that it worked correctly? – lnafziger May 05 '12 at 06:10
  • Clean did not work, but Rebooting laptop, manually cleaning all references to the file, etc, seems to have 'solved' the problem. weird... although I also did have to remove the lightweight migration lines to force not to migrate (as this would take many minutes). Overall, this isn't the clean implementation I would hope for, but this works... until a version 2 needs a data model upgrade, then I am in trouble – JP Hribovsek May 05 '12 at 06:17

2 Answers2

6

Pre-generate your database using an offline application (say, a command-line utility) written in Cocoa, that runs on OS X, and uses the same Core Data framework that iOS uses. You don't need to worry about "indexes surviving" or anything -- the output is a Core Data-generated .sqlite database file, directly and immediately usable by an iOS app.

As long as you can do the DB generation off-line, it's the best solution by far. I have successfully used this technique to pre-generated databases for iOS deployment myself. Check my previous questions/answers for a bit more detail.

Shaggy Frog
  • 27,575
  • 16
  • 91
  • 128
  • What do you mean, I don't need to worry about my indexes; as I stated in my question, I did this exact method, the output was a sqlite database file (200Mb), and when used in my app with exact same model, the file went down to 110Mb and clearly performance suggested that my indexes were not working. So I DO worry about my indexes, this is the whole point! – JP Hribovsek May 04 '12 at 07:22
  • @nafziger, do you mean you had indexes in your core data model, and that you know for sure that those indexes are still working as they should once you reuse that sqlite file? If so, what was your methodology to make sure your indexes were still working? – JP Hribovsek May 04 '12 at 07:24
  • @JP Hribovsek there is a difference between a plain-old SQLite database file generated by SQLite (which is generally not directly usable by Core Data), and a SQLite database file generated by Core Data. I have used the system I suggested, using the exact same Core Data data model, in both an iOS application and a OS X Cocoa command line utility program, with no problems. I have also pre-generated a SQLite DB, using SQLite, for use in an iOS application using SQLite, without problems -- but that was before Core Data arrived on iOS, which simplified this problems greatly. – Shaggy Frog May 04 '12 at 07:32
  • I may not have described this correctly, but let me clarify what I did: I used my CSV loader into core data code on the simulator. In the same app (same core data model), I remove the sqlite file from the simulator data, moved it in the bundle, and edited the code to not do a CSV import anymore, but just reuse the sqlite file from the bundle. I don't say that it does not work in appearance, it DOES "work", as in I can query and get my results. But the sqlite that was originally 200Mb, and still that size when moved to bundle, became a 120Mb file, and performance shows my indexes are gone – JP Hribovsek May 04 '12 at 07:38
  • @JP Hribovsek why are you using the Simulator? Create a bonafide OS X application to pre-generate the database. – Shaggy Frog May 04 '12 at 08:09
  • Why not.. I am making an iOS app, I am using iOS simulator, sqlite file is available right there, I am reusing it in the SAME app, if anything I would expect that to work better than doing a OSX app to move the sqlite file in iOS. Again, I want to make sure there is no confusion, I do know how to generate a sqlite file through Core Data and reuse it, I have done it countless times. But never before did I need to have an indexed attribute and millions of lines in the initial DB. Have you yourself been able to verify that the indexing is not 'removed' from the seed in this kind of case? – JP Hribovsek May 04 '12 at 08:17
  • @ShaggyFrog So I have done multiple tests, including building a separate OS X tool with same data model as my target app. What it did for me is that it forced a lightweight migration when the sqlite was copied over at first launch. The migration rebuilds the indexes. However, on a 2 million records table (one column indexed), this still takes 3+ min on the new iPad, and isn't performed in the background. The final app will have 10+millions records, so I am almost back to square one.. – JP Hribovsek May 04 '12 at 19:21
  • @JP Hribovsek there should be no migration, no rebuilding, no loss of information at all. Perhaps I am not clear on what you are doing. I don't think I can help much more. – Shaggy Frog May 04 '12 at 21:10
  • @ShaggyFrog thanks for trying anyway. I have the feeling though that I keep getting answers about how to do what I've been doing time and time again, which we all know is "working". But nobody actually checking the indexes. If anybody has too much time on their hands, I invite them to use the usual method on a 10 millions record csv file, index one of the attribute when doing that. Notice the generated sqlite file. Use that in their bundle, and observe the resulting sqlite file & performance while query against the indexed attribute. – JP Hribovsek May 04 '12 at 22:50
  • Also if somebody is trying that, make sure to turn lightweight migration off and use same model version not to run into any issues. With migration on and a different model version, this would look like it works (after a few seconds loading on simulator), but lightweight migration on this kind of data on the actual device takes minutes. – JP Hribovsek May 04 '12 at 22:53
  • @ShaggyFrog One last thing, I edited my question with I think a very detailed description of the problem, if that helps... – JP Hribovsek May 04 '12 at 23:17
0

I'm just starting out with SQLite and I need to integrate a DB into one of my apps that will have a lot of indexed data in a SQLite database. I was hoping I could do some method where I could bulk insert my information into a SQLite file and add that file to my project. After discovering and reading through your question, the provided answer and the numerous comments, I decided to check out the SQLite source to see if I could make heads or tails of this issue.

My initial thought was that the iOS implementation of SQLite is, in fact, throwing out your indices. The reason is because you initially create your DB index on x86/x64 system. The iOS is an ARM processor, and numbers are handled differently. If you want your indexes to be fast, you should generate them in such a way that they are optimized for the processor in which they will be searched.

Since SQLite is for multiple platforms, it would make since to drop any indices that have been created in another architecture and rebuild them. However, since no one wants to wait for an index to rebuild the first time it is accessed, the SQLite devs most likely decided to just drop the index.

After digging into the SQLite code, I've come to the conclusion that this is most likely happening. If not for the processor architecture reason, I did find code (see analyze.c and other meta-information in sqliteint.h) where indices were being deleted if they were generated under an unexpected context. My hunch is that the context that drives this process is how the underlying b-tree data structure was constructed for the existing key. If the current instance of SQLite can't consume the key, it deletes it.

It is worth mentioning that the iOS Simulator is just that-- a simulator. It is not an emulator of the, hardware. As such, your app is running in a pseudo-iOS device, running on an x86/x64 processor.

When your app and SQLite DB are loaded to your iOS device, an ARM-compiled variant is loaded, which also links to the ARM compiled libraries within iOS. I couldn't find ARM specific code associated with SQLite, so I imagine Apple had to modify it to their suit. The could also be part of the problem. This may not be an issue with the root-SQLite code, it could be an issue with the Apple/ARM compiled variant.

The only reasonable solution that I can come up with is that you can create a generator application that you run on your iOS machine. Run the application, build the keys, and then rip the SQLite file from the device. I'd imagine such a file would work across all devices, since all ARM processors used by iOS are 32-bit.

Again, this answer is a bit of an educated guess. I'm going to re-tag your question as SQLite. Hopefully a guru may find this and be able to weigh in on this issue. I'd really like to know the truth for my own benefit.

RLH
  • 15,230
  • 22
  • 98
  • 182