0

In my database, I have a table for companies. This table has a field called tags which would contain something like:

Furniture Retail E-commerce B2C Home & Furniture Consumer Discretionary Furniture British manufacturer retailer contemporary vintage furniture products Designs

What I want to be able to do is query these tags and return companies based on whether or not this field contains any of the keywords or phrases a user might enter.

For example, if a user wants to find a company whose tags contain the word Retail, it will generate a query like this:

SELECT
    company.domain,
    company.company_name,
    CONCAT_WS(
        ',',
        company.business_sector,
        company.tags
    ) AS 'tags',
    GROUP_CONCAT(
        employee.employee_id SEPARATOR ','
    ) AS 'employee_ids',
    COUNT(employee.employee_id) AS 'employees'
FROM
    company
INNER JOIN employee ON company.domain = employee.domain
WHERE
    company.tags REGEXP '^Retail| Retail |Retail$'
OR company.business_sector LIKE '%Retail%'
AND company.domain NOT IN (
    '@hotmail.com',
    '@gmail.com',
    '@aol.com'
)
GROUP BY
    company.domain

This exact query returns 11424 results, which is great.

Now the part where it falls over, is when a user enters a keyword that SHOULD NOT be inside this field.

So let's say we don't want any Apparel in there, it would generate this query:

SELECT
    company.domain,
    company.company_name,
    CONCAT_WS(
        ',',
        company.business_sector,
        company.tags
    ) AS 'tags',
    GROUP_CONCAT(
        employee.employee_id SEPARATOR ','
    ) AS 'employee_ids',
    COUNT(employee.employee_id) AS 'employees'
FROM
    company
INNER JOIN employee ON company.domain = employee.domain
WHERE
    company.tags REGEXP '^Retail| Retail |Retail$'
OR company.business_sector LIKE '%Retail%'
AND (
    company.tags NOT REGEXP '^Apparel| Apparel |Apparel$'
    AND company.business_sector NOT LIKE '%Apparel%'
)
AND company.domain NOT IN (
    '@hotmail.com',
    '@gmail.com',
    '@aol.com'
)
GROUP BY
    company.domain

This exact query returns 112 results, which should definitely not be the case as there are not 11312 companies with the keyword Apparel in my database.

Any ideas on what I'm doing wrong

EDIT

This is not a duplicate... I can modify my query, but that's not where the problem lays.

For example, let's take those 11424 results from Retail and put in a random phrase that we know will NEVER be in ANY of the results, we should get the same 11424 records:

SELECT
    company.domain,
    company.company_name,
    CONCAT_WS(
        ',',
        company.business_sector,
        company.tags
    ) AS 'tags',
    GROUP_CONCAT(
        employee.employee_id SEPARATOR ','
    ) AS 'employee_ids',
    COUNT(employee.employee_id) AS 'employees'
FROM
    company
INNER JOIN employee ON company.domain = employee.domain
WHERE
    (
        company.tags REGEXP '^Retail| Retail |Retail$'
        OR company.business_sector LIKE '%Retail%'
    )
AND (
    company.tags NOT REGEXP '^This phrase will never occur| This phrase will never occur |This phrase will never occur$'
    AND company.business_sector NOT LIKE '%This phrase will never occur%'
)
AND company.domain NOT IN (
    '@hotmail.com',
    '@gmail.com',
    '@aol.com'
)
GROUP BY
    company.domain

Instead of getting 11424, I get 135 records from this. How?

ThePerplexedOne
  • 2,920
  • 15
  • 30
  • You should fix your data structure so the tags are stored in a separate table, with one per row. – Gordon Linoff Dec 20 '16 at 12:10
  • @GordonLinoff Sure, but users need to be able to search by phrases, as well as words. E.g. `Contract Hire`. So I can't store `Contract` and `Hire` as separate words. The `tags` field is generated from a company description, and has common words like `and`, `they`, `of` etc removed from it. – ThePerplexedOne Dec 20 '16 at 12:17
  • Do you mean that a search for `Contract Hire` should NOT find `Hire Contract`? – Álvaro González Dec 20 '16 at 12:34
  • 1
    Maybe you should be using full-text search. – Barmar Dec 20 '16 at 12:38
  • You can simplify the regexp to `[[:<:]]Retail[[:>:]]`. See http://dev.mysql.com/doc/refman/5.7/en/regexp.html – Barmar Dec 20 '16 at 12:41
  • How many rows does `...WHERE company.tags REGEXP '^Retail| Retail |Retail$' GROUP BY company.domain` give you (so without the other conditions)? Also: can `business_sector` be `null`, which would make the result of `AND (NOT a AND NOT b)` different from `AND NOT (a or b)`? So if a lot/most of your rows would have `null` there, it could be another possible explanation for your troubles. – Solarflare Dec 20 '16 at 20:49
  • `x OR y AND z` means `x OR ( y AND z )` -- perhaps the expressions is messed up? – Rick James Dec 21 '16 at 23:32

3 Answers3

0

You really should normalize your data, storing the tags in a separate table so that you don't have to do super-complex logic like this.

In the meantime, your problem is with Boolean group. AND has precedence over OR, so your query should be

SELECT
    company.domain,
    company.company_name,
    CONCAT_WS(
        ',',
        company.business_sector,
        company.tags
    ) AS 'tags',
        GROUP_CONCAT(
            employee.employee_id SEPARATOR ','
        ) AS 'employee_ids',
        COUNT(employee.employee_id) AS 'employees'
FROM
    company
INNER JOIN employee ON company.domain = employee.domain
WHERE
    (company.tags REGEXP '^Retail| Retail |Retail$'
    OR company.business_sector LIKE '%Retail%')
AND company.tags NOT REGEXP '^Apparel| Apparel |Apparel$'
AND company.business_sector NOT LIKE '%Apparel%'
AND company.domain NOT IN (
    '@hotmail.com',
    '@gmail.com',
    '@aol.com'
)
GROUP BY
    company.domain

Pay close attention to the parentheses locations.

elixenide
  • 44,308
  • 16
  • 74
  • 100
  • Somehow I get even less results with this. – ThePerplexedOne Dec 20 '16 at 13:41
  • @ThePerplexedOne Well, yes. Your original query wasn't filtering everything, so you would expect fewer results with the correct query. – elixenide Dec 20 '16 at 13:52
  • Okay but it's skipping out companies who have the tag `Retail Sector`, they completely get ignored. – ThePerplexedOne Dec 20 '16 at 13:55
  • @ThePerplexedOne Your original query - the one that returns 11424 results - is wrong. You have incorrect Boolean logic there, making it return too many rows. As for the other problem - excluding companies with the tag "Retail Sector" - this is why you really should normalize your data. It's *very* hard - if not impossible - to write queries that handle cases like that when you stick everything in one field like you are doing. What if you have tags like "Retail", "Apparel", and "Retail Apparel"? How do you know what to do with those? The big problem here is the database design, not the query. – elixenide Dec 21 '16 at 04:57
0

In these cases you need to use (even overuse) parentheses to structure the ORs and ANDs of your WHERE clause. It's best to spell out the associativity you want in your filtering expressions.

Try something like this

WHERE  (
             company.tags REGEXP '^Retail| Retail |Retail$'
          OR company.business_sector LIKE '%Retail%')
AND NOT (
             company.tags REGEXP '^Apparel| Apparel |Apparel$'
         OR company.business_sector LIKE '%Apparel%'
)
AND NOT company.domain IN (
    '@hotmail.com',
    '@gmail.com',
    '@aol.com'
)
O. Jones
  • 103,626
  • 17
  • 118
  • 172
  • This gives me one result with the JOIN. It returns the first company with all of the employees it would have found for other companies. – ThePerplexedOne Dec 20 '16 at 13:37
0

I found something that works perfect for now, I'm using a MATCH AGAINST full-text search method:

SELECT
    company.domain,
    company.company_name,
    CONCAT_WS(
        ',',
        company.business_sector,
        company.tags
    ) AS 'tags',
    GROUP_CONCAT(
        employee.employee_id SEPARATOR ','
    ) AS 'employee_ids',
    COUNT(employee.employee_id) AS 'employees',
    COUNT(ct_connections.id) AS 'already_connected'
FROM
    company
INNER JOIN employee ON company.domain = employee.domain
LEFT JOIN ct_connections ON employee.email = ct_connections.email
AND ct_connections.client_id = 1
WHERE
    (
        MATCH (company.tags) AGAINST ('Retail')
        OR company.business_sector LIKE '%Retail%'
    )
AND (
    NOT MATCH (company.tags) AGAINST ('Apparel')
    AND company.business_sector NOT LIKE '%Apparel%'
    AND NOT MATCH (company.tags) AGAINST ('Footwear')
    AND company.business_sector NOT LIKE '%Footwear%'
)
AND company.domain NOT IN (
    '@hotmail.com',
    '@gmail.com',
    '@aol.com'
)
GROUP BY
    company.domain
ThePerplexedOne
  • 2,920
  • 15
  • 30