How to efficiently select records matching substring in another table using BigQuery?

Question

I have a table of several million strings that I want to match against a table of about twenty thousand strings like this:

#standardSQL
SELECT record.* FROM `record`
JOIN `fragment` ON record.name
  LIKE CONCAT('%', fragment.name, '%')

Unfortunately this is taking an awful long time.

Considering that the fragment table is only 20k records, can I load it into a JavaScript array using a UDF and match it that way? I'm trying to figure out how to this right now but perhaps there's already some magic I could do here to make this faster. I tried a CROSS JOIN and got resource exceeded fairly quickly. I've also tried using EXISTS but I can't reference the record.name inside that subquery's WHERE without getting an error.

Example using Public Data

This seems to reflect about the same amount of data ...

#standardSQL
WITH record AS (
  SELECT LOWER(text) AS name
  FROM `bigquery-public-data.hacker_news.comments`
), fragment AS (
  SELECT LOWER(name) AS name, COUNT(*)
  FROM `bigquery-public-data.usa_names.usa_1910_current`
  GROUP BY name
)
SELECT record.* FROM `record`
JOIN `fragment` ON record.name
  LIKE CONCAT('%', fragment.name, '%')

can you think of any public data that you can use as an example? — Mikhail Berlyant, Jun 04 '19 at 22:13

score 2 · Accepted Answer · answered Jun 04 '19 at 23:08

Below is for BigQuery Standard SQL

#standardSQL
WITH record AS (
  SELECT LOWER(text) AS name
  FROM `bigquery-public-data.hacker_news.comments`
), fragment AS (
  SELECT DISTINCT LOWER(name) AS name
  FROM `bigquery-public-data.usa_names.usa_1910_current`
), temp_record AS (
  SELECT record, TO_JSON_STRING(record) id, name, item 
  FROM record, UNNEST(REGEXP_EXTRACT_ALL(name, r'\w+')) item 
), temp_fragment AS (
  SELECT name, item FROM fragment, UNNEST(REGEXP_EXTRACT_ALL(name, r'\w+')) item
)
SELECT AS VALUE ANY_VALUE(record) FROM (
  SELECT ANY_VALUE(record) record, id, r.name name, f.name fragment_name
  FROM temp_record r
  JOIN temp_fragment f
  USING(item)
  GROUP BY id, name, fragment_name
) 
WHERE name LIKE CONCAT('%', fragment_name, '%')
GROUP BY id

above was completed in 375 seconds, while original query is still running at 2740 seconds and keep running, so I will not even wait for it to complete

shouldn't he remove the 1st % to use the database index since '%'fragment_name'%' will do a full table scan every time ? — Ahmed Aboud, Jun 04 '19 at 23:14
he is using BigQuery - no index relevant here. Also `name LIKE CONCAT('%', fragment_name, '%')` mean match all name where value of fragment_name is anywhere inside value of name — Mikhail Berlyant, Jun 04 '19 at 23:16

Felipe Hoffa · Answer 2 · 2019-06-05T19:17:47.200

1

Mikhail's answer appears to be faster - but lets have one that doesn't need to SPLIT nor separate the text into words.

First, compute a regular expression with all the words to be searched:

#standardSQL
WITH record AS (
  SELECT text AS name
  FROM `bigquery-public-data.hacker_news.comments`
), fragment AS (
  SELECT name AS name, COUNT(*)
  FROM `bigquery-public-data.usa_names.usa_1910_current`
  GROUP BY name
)
SELECT FORMAT('(%s)',STRING_AGG(name,'|'))
FROM fragment

Now you can take that resulting string, and use it in a REGEX ignoring case:

#standardSQL
WITH record AS (
  SELECT text AS name
  FROM `bigquery-public-data.hacker_news.comments`
), largestring AS (
   SELECT '(?i)(mary|margaret|helen|more_names|more_names|more_names|josniel|khaiden|sergi)'
)

SELECT record.* FROM `record`
WHERE REGEXP_CONTAINS(record.name, (SELECT * FROM largestring))

(~510 seconds)

edited Jun 05 '19 at 19:17

answered Jun 05 '19 at 04:35

Felipe Hoffa

54,922
16
151
325

I noticed the `LIMIT 10` in there which would reduce the number of fragments. I'm wondering what the limit on a string aggregation is in BigQuey. – Neil C. Obremski Jun 05 '19 at 19:15
Oops. Just removed the `LIMIT 10`. It worked for me (~510 seconds) . – Felipe Hoffa Jun 05 '19 at 19:18
just to double-check - have you rerun string for `largestring` as it still has 10 items? – Mikhail Berlyant Jun 05 '19 at 19:19
1

Yes. It would take way less than ~510 seconds if I only had 10 names. – Felipe Hoffa Jun 05 '19 at 19:21

score 0 · Answer 3 · answered Jun 05 '19 at 19:10

As eluded to in my question, I worked on a version using a JavaScript UDF which solves this albeit in a slower way than the answer I accepted. For completeness, I'm posting it here because perhaps someone (like myself in the future) may find it useful.

CREATE TEMPORARY FUNCTION CONTAINS_ANY(str STRING, fragments ARRAY<STRING>)
RETURNS STRING
LANGUAGE js AS """
  for (var i in fragments) {
    if (str.indexOf(fragments[i]) >= 0) {
      return fragments[i];
    }
  }
  return null;
""";

WITH record AS (
  SELECT text AS name
  FROM `bigquery-public-data.hacker_news.comments`
  WHERE text IS NOT NULL
), fragment AS (
  SELECT name AS name, COUNT(*)
  FROM `bigquery-public-data.usa_names.usa_1910_current`
  WHERE name IS NOT NULL
  GROUP BY name
), fragment_array AS (
  SELECT ARRAY_AGG(name) AS names, COUNT(*) AS count
  FROM fragment
  GROUP BY LENGTH(name)
), records_with_fragments AS (
  SELECT record.name,
    CONTAINS_ANY(record.name, fragment_array.names)
      AS fragment_name
  FROM record INNER JOIN fragment_array
    ON CONTAINS_ANY(name, fragment_array.names) IS NOT NULL
)
SELECT * EXCEPT(rownum) FROM (
  SELECT record.name,
         records_with_fragments.fragment_name,
         ROW_NUMBER() OVER (PARTITION BY record.name) AS rownum
  FROM record
  INNER JOIN records_with_fragments
     ON records_with_fragments.name = record.name
    AND records_with_fragments.fragment_name IS NOT NULL
) WHERE rownum = 1

The idea is that the list of fragments is relatively small enough that it can be processed in an array, similar to Felipe's answer using regular expressions. The first thing I do is create a fragment_array table which is grouped by the fragment lengths ... a cheap way of preventing an over-sized array which I found can cause UDF timeouts.

Next I create a table called records_with_fragments that joins those arrays to the original records, finding only those which contain a matching fragment using the JavaScript UDF CONTAINS_ANY(). This will result in a table containing some duplicates since one record may match multiple fragments.

The final SELECT then pulls in the original record table, joins to records_with_fragments to determine which fragment matched, and also uses the ROW_NUMBER() function to prevent duplicates, e.g. only showing the first row of each record as uniquely identified by its name.

Now, the reason I do the join in the final query is because in my actual data there are more fields I want besides just the string being matched. Earlier on in my actual data I create a table of DISTINCT strings which then later need to be re-joined.

Voila! Not the most elegant but it gets the job done.

How to efficiently select records matching substring in another table using BigQuery?

Example using Public Data

3 Answers3

Linked