0

We have a problem that at the moment we are not allowed to use ElasticSearch, so we need to implement a search function with MySQL. One desired feature is a prefixed, tokenized search, so a sentence like

"The quick brown fox jumped over the lazy dog" could be findable when you search for "jump". I think I would need to define a rule like (pseudocode):

(*)(beginning OR whitespace)(prefix)(*)

I assume it is possible to do that with JPA (Criteria API)? But what if we have two terms? All of them have to be combined by AND, e.g. the above rule should result in TRUE for both terms in at least one column. That means "jump fox" would result in a hit, but "jump rabbit" would not. Is that also possible with Criteria API?

Or do you know a better solution than Criteria API? I heard Hibernate can do LIKE queries more elegantly (with less code) but unfortunately we use EclipseLink.

Based on the answer below here is my full solution. It's all in one method to keep it simple here ("simple JPA criteria API" is an oxymoron though). If anyone wants to use it, consider some refactoring

public List<Customer> findMatching(String searchPhrase) {
    List<String> searchTokens = TextService.splitPhraseIntoNonEmptyTokens(searchPhrase);
    if (searchTokens.size() < 1 || searchTokens.size() > 5) { // early out and denial of service attack prevention
        return new ArrayList<>();
    }

    CriteriaBuilder criteriaBuilder = entityManager.getCriteriaBuilder();
    CriteriaQuery<Customer> criteriaQuery = criteriaBuilder.createQuery(Customer.class);

    Root<Customer> rootEntity = criteriaQuery.from(Customer.class);

    Predicate[] orClausesArr = new Predicate[searchTokens.size()];

    for (int i = 0; i < searchTokens.size() ; i++) {
        // same normalization methods are used to create the indexed searchable data
        String assumingKeyword = TextService.normalizeKeyword(searchTokens.get(i));
        String assumingText = TextService.normalizeText(searchTokens.get(i));
        String assumingPhoneNumber = TextService.normalizePhoneNumber(searchTokens.get(i));

        String assumingKeywordInFirstToken = assumingKeyword + '%';
        String assumingTextInFirstToken = assumingText + '%';
        String assumingPhoneInFirstToken = assumingPhoneNumber + '%';
        String assumingTextInConsecutiveToken = "% " + assumingText + '%';

        Predicate query = criteriaBuilder.or(
            criteriaBuilder.like(rootEntity.get("normalizedCustomerNumber"), assumingKeywordInFirstToken),
            criteriaBuilder.like(rootEntity.get("normalizedPhone"), assumingPhoneInFirstToken),
            criteriaBuilder.like(rootEntity.get("normalizedFullName"), assumingTextInFirstToken),
            // looking for a prefix after a whitespace:
            criteriaBuilder.like(rootEntity.get("normalizedFullName"), assumingTextInConsecutiveToken)
        );
        orClausesArr[i] = query;
    }

    criteriaQuery = criteriaQuery
            .select(rootEntity) // you can also select only the display columns and ignore the normalized/search columns
            .where(criteriaBuilder.and(orClausesArr))
            .orderBy(
                    criteriaBuilder.desc(rootEntity.get("customerUpdated")),
                    criteriaBuilder.desc(rootEntity.get("customerCreated"))
            );
    try {
        return entityManager
                .createQuery(criteriaQuery)
                .setMaxResults(50)
                .getResultList();
    } catch (NoResultException nre) {
        return new ArrayList<>();
    }
}
Phil
  • 7,065
  • 8
  • 49
  • 91

1 Answers1

0

The Criteria API is certainly not intended for this but it can be used to create LIKE predicates.

So for each search term and each column you want to search you would create something like the following:

column like :term + '%'
or column like ' ' + :term + '%'
or column like ',' + :term + '%'
// repeat for all other punctuation marks and forms of whitespace you want to support.

This will create horribly inefficient queries!

I see the following alternatives:

  1. Use database specific features. Some databases have some text search capabilities. If you can limit your application to one or few databases that might work.

  2. Create your own index: Use a proper tokenizer to analyze the columns you want to search and put the resulting tokens in a separate table with backreferences to the original table. Now search that for the terms you are looking for. As long as you do only prefixed searches database indexes should be able to keep this reasonable efficient and it is easier to maintain and more flexible than what you can obtain by using the Criteria API on its own.

Jens Schauder
  • 77,657
  • 34
  • 181
  • 348
  • Thank you very much, I will try this in the next days and report back if it worked. As for Improvement 1: is not possible for us because we will move from MySQL to Aurora soon and I don't want to make the migration harder. And I already took care of Improvement 2. I am iterating over a table with both search and display columns. The search columns have normalized (like lowercase and German Umlaut replacements) and tokenized (separators converted to whitespace) data. – Phil Dec 06 '18 at 10:28
  • The only thing I am missing is an array data type for better tokenization, but that ticket has not been assigned since 2007: https://dev.mysql.com/worklog/?sc=&sd=&k=WL%232081&p=&s= – Phil Dec 06 '18 at 10:29
  • Yes, when I refactor it, I might work with foreign keys instead of duplicate data. It is the way it is because the first approach was to load all the data in one DB-request and do the filtering in Java. And with foreign keys this resulted in the N+1 problem. – Phil Dec 06 '18 at 10:33