1

I need some advice regarding updating a large number of records in a MySql database. I am currently storing a large number of words and their suffixes (a suffix array) into a database which results in a row count of approximately 4.3 million. Each record contains the primary key id, the actual word word, the document the word is in document, the offset of the word within the document 'offset', a flag which determines whether the record is a whole word or not flag and a link to the next record with the same value for word. Each record is initialized with a link value of -1.

This is my current code for updating the links in the database:

public void Link(object c)
    {
        DBConnection conn = (DBConnection)c;

        rowcount = conn.GetRowCount();
        string word;
        int link;
        List<Record> recordsList = new List<Record>();
        List<Record> recordsMatched = new List<Record>();

        for (int i = 0; i < rowcount; i++)
        {
            recordsList.AddRange(conn.ReadQuery("SELECT * FROM csa2018.words WHERE id = " + i));
            word = recordsList[0].Word;
            link = recordsList[0].Link;

            recordsMatched = conn.ReadQuery("SELECT * FROM csa2018.words WHERE word = '" + word + "'");

            for(int j = 0; j < recordsMatched.Count-1; j++)
            {
                if (recordsMatched[j].Link == -1)
                {
                    conn.WriteQuery("UPDATE csa2018.words SET link = " + recordsMatched[j + 1].Id + " WHERE id = " + recordsMatched[j].Id);
                }
                else
                {
                    break;
                }
                linkedRecords++;
            }
            linkedRecords++;

            recordsMatched.Clear();
            recordsList.Clear();
        }
        Form1.linkingFinished = true;
    }

Overall, it has good performance when it finds words which are repeated frequently; however at around 60% the performance deteriorates because most of the remaining words are unique. My guess is that this query:

recordsMatched = conn.ReadQuery(
"SELECT * FROM csa2018.words WHERE word = '" + word + "'");

shouldn't be like this because it is being called once for every row. Are there any better approaches like using stored procedures maybe?

P.S.: the ReadQuery method reads rows using the query supplied and constructs a Record object and adds each record to a List<Record>.

This is what my database looks like :

CREATE TABLE words ( id int(11) NOT NULL, word varchar(45) NOT NULL,
document varchar(45) NOT NULL, offset int(11) NOT NULL, flag int(11) NOT NULL,
link int(11) DEFAULT NULL, PRIMARY KEY (id) ) ENGINE=InnoDB DEFAULT CHARSET=utf8
Jurgen Camilleri
  • 3,559
  • 20
  • 45
  • Please post the table creation details. We need this to see what indexes you defined on the table, and what column data types you used. In the mysql command line you can get this information conveniently by running: `SHOW CREATE TABLE csa2018.words` – Roland Bouman Mar 02 '12 at 12:56
  • First, add an index to word. Second, you should use parameters in your queries. See http://stackoverflow.com/questions/652978/parameterized-query-for-mysql-with-c-sharp – jle Mar 02 '12 at 12:58
  • The combination of large row-count and a nested loop is bound to cause problems; some kind of dictionary would be better - but IMO the biggest problem here is so many SQL queries; bulk operations are best done in bulk, via set-based SQL. Not row-based. – Marc Gravell Mar 02 '12 at 13:09
  • So how could I go about getting all instances of a particular word and setting their links at one go? I don't have much experience in SQL. – Jurgen Camilleri Mar 02 '12 at 13:24
  • you will improve the performance if you use parameters in SQL. Every SQL-Statement should be build outside of the loops. goto: http://dev.mysql.com/doc/refman/5.0/es/connector-net-examples-mysqlcommand.html and search for '(?val)' – user1027167 Mar 03 '12 at 15:55

1 Answers1

0

if I understand your code correctly than this single sql-statement should do the job:

UPDATE  csa2018.words as w1 

    left join 

    (select w2.id as id, min(w3.id) as linked_to 
    from    csa2018.words w2, csa2018.words w3 
    where   w2.word = w3.word and 
            w3.id > w2.id limit 1) w4

    on (w1.id = w4.id)

SET     w1.link = IFNULL(w4.linked_to, -1)

The inner select-statement gives the mapping from one dataset to the linked dataset. You should watch the result of the select-statement to see if everthing is fine.

user1027167
  • 4,320
  • 6
  • 33
  • 40
  • unfortunately this code seems to work very slow, it didn't even get to 1% in 10 minutes. Maybe because I changed the database engine? I added an index on the text field and switched from InnoDB to MyISAM and it's working much faster now. Thanks anyway! – Jurgen Camilleri Mar 03 '12 at 10:30
  • you could try: "select w2.id as id, w3.id as linked_to from csa2018.words w2, csa2018.words w3 where w2.word = w3.word and w3.id > w2.id order by w3.id limit 1" as the inner select-statement (which I think should not be better) – user1027167 Mar 03 '12 at 15:35
  • to optimize InnoDB you could try to disable keys and constraints during the update. Search for "ALTER TABLE [tablename] DISABLE KEYS;" and "SET FOREIGN_KEY_CHECKS=0;". But I think also this should not give a better performance, so just try it. – user1027167 Mar 03 '12 at 15:38