0

I have a datum that I need to find within the database, for example 'dsfsdfsads'. But, there are 100+ tables and views to search through. I have found numerous queries written for other databases that can find a specific string within the database. Example postings are below. However, I don't see the same for BigQuery. I found this question: Is it possible to do a full text search in all tables in BigQuery?, but this post feels incomplete and the 2 links provided in the answer do not answer my question.

Examples of other database queries capable of finding specific string:

Find a string by searching all tables in SQL Server Management Studio 2008

Search all tables, all columns for a specific value SQL Server

How do I search an SQL Server database for a string?

Kirby
  • 15,127
  • 10
  • 89
  • 104
  • BigQuery is engineered for large scale datasets, it's a data lake that scales to petabytes. It's unfeasible what you want to accomplish. – Pentium10 Dec 07 '20 at 19:29

1 Answers1

1

I am not sure why it doesn't suit you to search through your database using a wildcard table like in the post you mentioned. Because I have run this sample query to search through a public dataset and it works just fine.

SELECT *
FROM `bigquery-public-data.baseball.*` b
WHERE REGEXP_CONTAINS(TO_JSON_STRING(b), r'Cubs')

I guess it is because one of the limitation is that the wildcard table functionality does not support views.
Do you have a lot of them?

In that case you can use the wildcard only for your tables and filter out the views with _TABLE_SUFFIX or with a less general wildcard (it depends on the names of your views).
In general, with wildcard tables, using _TABLE_SUFFIX can greatly reduce the number of bytes scanned, which reduces the cost of running your queries. So use it also if you suspect of some tables to contain the string more then others.

For the views (or the whole dataset), you could:
• Iterate by calling the BigQuery API using one of the libraries with some multiprocessing module like multiprocessing in Python.
• Iterate by calling the REST API from a bash script.
• Iterate by using the bq command from a bash script.

If you get stuck with the programmatic part, post a new question and add the link here.

EDIT:

Here are two examples for you (bash and python). I tried them both and they work but any comments to help improve are welcome of course.

Python:

  1. Install packages
pip install --upgrade google-cloud-bigquery
pip install multiprocess
  1. Create filename.py. Change YOUR_PROJECT_ID and YOUR_DATASET.

from google.cloud import bigquery
import multiprocessing

def search(dataset_id):
    """
    Lists and filters your dataset to keep only views
    """
    client = bigquery.Client()
    
    tables = client.list_tables(dataset_id)
    views = []
    for table in tables:
        if table.table_type == 'VIEW':
            views.append(table.table_id)
    return views

def query(dataset_id, view):
    """
    Searches for the string in your views and prints the first one it finds.
    You can change or remove 'LIMIT 1' if needed.
    """
    client = bigquery.Client()
    query_job = client.query(
        """
        SELECT *
        FROM {}.{} b
        WHERE REGEXP_CONTAINS(TO_JSON_STRING(b), r"true")
        LIMIT 1
        """.format(dataset_id, view)
   )

    results = query_job.result()  # Waits for job to complete.
    for row in results:
        print(row)

if __name__ == '__main__':
    # TODO: Set dataset_id to the ID of the dataset that contains the tables you are listing.
    dataset_id = 'YOUR_PROJECT_ID.YOUR_DATASET'
    views = search(dataset_id)

    processes = []
    for i in views:
        p = multiprocessing.Process(target=query, args=(dataset_id, i))
        p.start()
        processes.append(p)
    for process in processes:
        process.join()

Run python filename.py


Bash:

  1. Install jq (json parser) and test it
sudo apt-get install jq

Test

echo '{ "name":"John", "age":31, "city":"New York" }' | jq .

Output:

{
  "name": "John",
  "age": 31,
  "city": "New York"
}

Reference

  1. Create filename.sh. Change YOUR_PROJECT_ID and YOUR_DATASET.
#!/bin/bash
FILES="bq ls --format prettyjson YOUR_DATASET"
RESULTS=$(eval $FILES)
DETAILS=$(echo "${RESULTS}" | jq -c '.[]')
for d in $DETAILS
do
        ID=$(echo $d | jq -r .tableReference.tableId)
        table_type=$(echo $d | jq -r '.type')
        if [[ $table_type == "VIEW" ]]
          then
                bq query --use_legacy_sql=false \
                'SELECT *
                FROM
                `YOUR_PROJECT_ID`.YOUR_DATASET.'$ID' b
                WHERE REGEXP_CONTAINS(TO_JSON_STRING(b), r"true")
                LIMIT 1'
          fi

done

Run bash filename.sh

Ksign
  • 779
  • 5
  • 11
  • My issue is that the data I need to find is contained within the views. The tables are just lookup tables while the dynamic data is currently stored in views. I will definitely create a new question and post it once formed as I am not yet familiar with any of the suggestion you listed. Thank you so much for the help. – WouldLiketoKnowMore Dec 10 '20 at 21:56
  • @WouldLiketoKnowMore, I have edited my answer with two programmatic options for you to try. Hope this helps. – Ksign Dec 15 '20 at 14:41
  • It took me a bit to be able to get this to run with the python script, security issues were slowing my ability to get this to run. I have looked at one entry that it returns and I see that it is working properly to find that string. I am not that familiar with Python, but is there a way to just return the table/view where that entry was located? I.e. instead of for row in results: print(row) is there a way to do for row in results: print(view) – WouldLiketoKnowMore Dec 30 '20 at 19:00
  • Instead of print (row), you can use the following line in order to print information about the dataset and the view where the string was found. The following line of code works because both the dataset_id and the view variables are inside the scope of the function. print(f"The string was found in the dataset: {dataset_id}, in the view: {view}") – rodvictor Jan 19 '21 at 16:51