1

I'm working with the patents-public-data.patents.publications_201710 table in Google's BigQuery. My goal is to identify potential corporate-owned patents. Since an inventor must be an individual and not a corporation, I can theoretically do so by identifying records where the assignee is not found in the list of inventors. The tricky part is that there are often multiple inventors and even multiple assignees for the same publication_number.

The table originally looks like:

Row publication_number  assignee                        inventor     
1   US-7011573-B2       Mcarthur Richard C              MCARTHUR RICHARD C.  
                        Holmes Dennis G                 HOLMES DENNIS G.     
                        Mcarthur Ronald J               MCARTHUR RONALD J.   
2   US-8746747-B2   IPS Corporation—Weld-On Division    MCPHERSON TERRY R

I've tried the following (unsuccessful) query:

#standard sql
SELECT
  p.publication_number,
  assignee,
  inventor
FROM
  `patents-public-data.patents.publications_201710` AS p,
  p.assignee assignee,
  p.inventor inventor
WHERE
  assignee not in (inventor) #want to run this within each p.publication_number-assignee group somehow
  AND p.publication_number IN ('US-8746747-B2',
    'US-7011573-B2')

This query produces the following output:

Row publication_number  assignee                            inventor     
1   US-7011573-B2       Mcarthur Richard C                  MCARTHUR RICHARD C.  
2   US-7011573-B2       Mcarthur Richard C                  HOLMES DENNIS G.     
3   US-7011573-B2       Mcarthur Richard C                  MCARTHUR RONALD J.   
4   US-7011573-B2       Holmes Dennis G                     MCARTHUR RICHARD C.  
5   US-7011573-B2       Holmes Dennis G                     HOLMES DENNIS G.
6   US-7011573-B2       Holmes Dennis G                     MCARTHUR RONALD J.   
7   US-7011573-B2       Mcarthur Ronald J                   MCARTHUR RICHARD C.  
8   US-7011573-B2       Mcarthur Ronald J                   HOLMES DENNIS G.     
9   US-7011573-B2       Mcarthur Ronald J                   MCARTHUR RONALD J.   
10  US-8746747-B2       IPS Corporation—Weld-On Division    MCPHERSON TERRY R   

First, I'd like to see if the assignee is contained in (not just equal to) the inventor variable since assignee omits the "." after the middle initial. Frequently the string in the assignee variable equals the inventor, but not always. Some examples of exceptions are:

  • assignee: "Daryl A. KRUPA"; inventor: "KRUPA Daryl A."
  • assignee: "KRUPADANAM Gazula Levi DAVID"; inventor: "DAVID KRUPADANAM, Gazula Levi"

I know I'll end up having to live with some false positives (record is actually assigned to an individual and not a corporation), but if I can address some of these issues through my query I'd prefer that.

Second, assuming the first concern is addressed, I can't just see if the assignee is contained in the inventor variable because that would inappropriately retain rows 2-4, 6-8.

My desired output is simply:

Row publication_number  assignee                            inventor     
1   US-8746747-B2       IPS Corporation—Weld-On Division    MCPHERSON TERRY R   

I've found somewhat relevant posts here and here, but they don't fully address my question or don't use SQL.

What is the appropriate standard SQL query for my desired output?

user19329
  • 57
  • 5
  • You would show sample data for each table. It is unclear what the table look like and your use of the archaic join method (`','`) does not provide this information. – Gordon Linoff Jan 05 '18 at 17:52
  • @GordonLinoff updated question. There is only one table and no unique identifier for either assignee or inventor. Thank you for the reply. – user19329 Jan 05 '18 at 18:04

2 Answers2

2

Below is for BigQuery Standard SQL

#standardSQL
CREATE TEMP FUNCTION normalizeString(phrase STRING) AS ((
  SELECT AS STRUCT phrase original, STRING_AGG(LOWER(word), ' ' ORDER BY word) normalized 
  FROM UNNEST(SPLIT(REGEXP_REPLACE(phrase, r'[,.]', ''), ' ')) word
));
CREATE TEMP FUNCTION removeDups(arr1 ARRAY<STRING>, arr2 ARRAY<STRING>) AS (ARRAY(
  SELECT a.original 
  FROM UNNEST(ARRAY(SELECT normalizeString(a) b FROM UNNEST(arr1) a ORDER BY b.normalized)) a, 
  UNNEST(ARRAY(SELECT normalizeString(a) b FROM UNNEST(arr2) a ORDER BY b.normalized)) i 
  GROUP BY 1 
  HAVING COUNTIF(a.normalized = i.normalized) = 0
));
SELECT publication_number,
  removeDups(assignee, inventor) assignee,
  removeDups(inventor, assignee) inventor
FROM `patents-public-data.patents.publications_201710`
WHERE publication_number IN ('US-8746747-B2', 'US-7011573-B2')
AND (ARRAY_LENGTH(removeDups(assignee, inventor)) > 0
OR ARRAY_LENGTH(removeDups(inventor, assignee)) > 0)   

You can test / play with above using below example with dummy data

#standardSQL
CREATE TEMP FUNCTION normalizeString(phrase STRING) AS ((
  SELECT AS STRUCT phrase original, STRING_AGG(LOWER(word), ' ' ORDER BY word) normalized 
  FROM UNNEST(SPLIT(REGEXP_REPLACE(phrase, r'[,.]', ''), ' ')) word
));
CREATE TEMP FUNCTION removeDups(arr1 ARRAY<STRING>, arr2 ARRAY<STRING>) AS (ARRAY(
  SELECT a.original 
  FROM UNNEST(ARRAY(SELECT normalizeString(a) b FROM UNNEST(arr1) a ORDER BY b.normalized)) a, 
  UNNEST(ARRAY(SELECT normalizeString(a) b FROM UNNEST(arr2) a ORDER BY b.normalized)) i 
  GROUP BY 1 
  HAVING COUNTIF(a.normalized = i.normalized) = 0
));
WITH `patents-public-data.patents.publications_201710` AS (
  SELECT 'US-8746747-B2' publication_number, ['IPS Corporation—Weld-On Division'] assignee, ['MCPHERSON TERRY R'] inventor UNION ALL 
  SELECT 'US-7011573-B2', ['Mcarthur Richard C', 'Holmes Dennis G', 'Mcarthur Ronald J'], ['MCARTHUR RICHARD C.', 'HOLMES DENNIS G.', 'MCARTHUR RONALD J.'] UNION ALL 
  SELECT 'TestA', ['Daryl A. KRUPA'], ['KRUPA Daryl A.'] UNION ALL 
  SELECT 'TestB', ['KRUPADANAM Gazula Levi DAVID'], ['DAVID KRUPADANAM, Gazula Levi'] 
)
SELECT publication_number,
  removeDups(assignee, inventor) assignee,
  removeDups(inventor, assignee) inventor
FROM `patents-public-data.patents.publications_201710`
WHERE publication_number IN ('US-8746747-B2', 'US-7011573-B2', 'TestA', 'TestB')
AND (ARRAY_LENGTH(removeDups(assignee, inventor)) > 0
OR ARRAY_LENGTH(removeDups(inventor, assignee)) > 0)  

Note: you can control logic of "names normalization" in normalizeString function. In example I provided - I just simply removing dot and comma - but you might want to enhance this

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • This works excellently. I did additionally exclude ";" to reduce false positives even further. I mention this just in case someone else would like to use this in the future. – user19329 Jan 05 '18 at 21:04
1

The following gets the patent/assignee information where the assignee is not an inventor:

SELECT p.publication_number, a.assignee
FROM `patents-public-data.patents.publications_201710` p JOIN
     p.assignee a
     ON p.assignee = a.assignee  LEFT JOIN -- guessing what the join keys are
     p.inventor i
     ON a.assignee = i.inventor  -- guessing what the join keys are
WHERE p.publication_number IN ('US-8746747-B2', 'US-7011573-B2') AND
      i.inventor IS NULL;

It is unclear what fields to use for the joins.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • That query does not work in my setting. Unfortunately, there is only one table and not a complete relational database in BigQuery. The assignee and inventor variables are housed in the same table and repeated for each record as now shown in the original question above. There is no unique identifier for either assignee or inventor, which is the source of a similar frustration with this table that I asked about in this post: [link](https://stackoverflow.com/questions/48037397/googles-big-query-using-sql-associate-the-assignee-name-and-harmonized-assigne). – user19329 Jan 05 '18 at 18:37