2

I have a set of text files that are uploaded every 5 minutes into the google cloud storage. I want to put them into BigQuery in every 5 minutes (because text files uploaded into Cloud Storage in every 5 min). I know text files cant to be uploaded into BigQuery. What is the best approach for this?

Sample of a text file enter image description here

Thanks in advance.

3 Answers3

1

He is an alternative approach, which will use an event-based Cloud Function to load data into BigQuery. Create a cloud function with "Trigger Type" as cloud storage. As soon as file/files loaded into cloud storage bucket, it will invoke/trigger cloud function event and data from cloud storage will be loaded into BigQuery.

import pandas as pd
from google.cloud import bigquery

def bqDataLoad(event, context):
    bucketName = event['bucket']
    blobName = event['name']
    fileName = "gs://" + bucketName + "/" + blobName
    
    bigqueryClient = bigquery.Client()
    tableRef = bigqueryClient.dataset("bq-dataset-name").table("bq-table-name")

    dataFrame = pd.read_csv(fileName)

    bigqueryJob = bigqueryClient.load_table_from_dataframe(dataFrame, tableRef)
    bigqueryJob.result()
Soumendra Mishra
  • 3,483
  • 1
  • 12
  • 38
  • Thanks.thats the best way so far. But I have few questions...my files are texts. I added sample of file to question. So how do I convert it to CSV? –  Aug 09 '20 at 06:57
  • in text files, there are several topics. so first I have to select one topic and write it back to CSV and then bigquery(Different topics have different data fields.Ex Topic cat has two columns and Bat has Three columns) –  Aug 09 '20 at 06:59
  • 1
    By default, it is expecting a comma delimited file, but you can use any delimiter of your choice. Please use a separator option while reading from file. Example: `dataFrame = pd.read_csv(fileName, sep="|")` – Soumendra Mishra Aug 09 '20 at 07:01
  • yes.But we cant read text files using pandas.read_csv() noh? you have any idea to convert this? –  Aug 09 '20 at 07:13
  • 1
    You can use delimited text file as-is by providing a separator. Example: dataFrame = pd.read_csv(`"file.txt"`, sep="|") – Soumendra Mishra Aug 09 '20 at 07:21
  • kindly tell me that what do you mean by blobName? is it subfolder or path from the bucket to a folder that contains text files? –  Aug 09 '20 at 11:52
  • Yes, it is the file name including folder & sub-folder path (Example: `folder/sub-folder/file.txt`) – Soumendra Mishra Aug 09 '20 at 12:17
  • in case of a large file like 5GB, should we use DASK ? – Tiago Medici Jul 06 '21 at 14:26
0

You can take advantage of BigQuery transfers.

  1. Create an empty BigQuery table with Schema (edit as text) Text:STRING
  2. Transform your .txt files into .csv files
  3. Create the BigQuery transfer from Google Cloud Storage
  4. Upload your .csv files into the GCS bucket
  5. Check if your transfer was successful

For now, this service transfers the newly added files every hour with a 1h minimum file age limitation that is on the way to be removed soon.

The service checks the presence of new files that are older than 1h from the time they were uploaded in the bucket, for example:

  • text1.csv was uploaded at 4:46
  • text2.csv was uploaded at 5:01
  • text3.csv was uploaded at 5:06
    Results:
  • The transfer run of 5:00 will not transfer any file
  • The transfer run of 6:00 will transfer text1.csv
  • The transfer run of 7:00 will transfer text2.csv and text3.csv

For step 2, you need to process your text file as to be accepted by BigQuery. I think the easiest way is to use .csv files. Edit your .txt file as follows:

  • adding the character " in the beginning and at the end of the text e.g. "I am going to the market to buy vegetables."
  • 'save as' the file as text1.csv
  • name the files to have the same beginning characters e.g. text[...].csv so to be able to use wildcards
  • repeat this for your next files (text2.csv, text3.csv, text4.csv ...)

You also need to make sure of the followings:

  • your text doesn't contains " characters inside the text - replace them with the ' character
  • make sure your whole text is inline as newlines (EOF) are not supported

For step 3, find below the suggested transfer configurations:

  • Schedule options:

Custom --> every 1 hours

  • Cloud Storage URI:

yourbucket/yourfolder/text*

The transfer will pick up all the files that start with the name text

  • Write preference:

APPEND

  • File format:

CSV

For step 5, verify in the Transfer details page each hour if the transfer was successful. If you get errors, the whole batch of files will not be transferred. Use the CLI (see the below command) to get information on which file has issues and the nature of the error. You will need to delete the respective file from the bucket, correct it and re-upload it.

bq --format=prettyjson show -j [bqts_...]

Also preview your BigQuery table to see your transferred texts.

MonicaPC
  • 379
  • 4
  • 15
  • Thanks. Do you have any idea to transform a set of .txt files located in the cloud storage into .csv files in bulk. All of them are in the same format. –  Aug 06 '20 at 06:45
  • I think you can run this script https://stackoverflow.com/a/41323026/9398998 to add the characters and as per this post https://unix.stackexchange.com/questions/32100/how-to-apply-the-same-awk-action-to-different-files to save them with new names in a new folder of your bucket. You might want to start a new question with the "bash" and "shell" tag to get customised reply. – MonicaPC Aug 06 '20 at 17:08
0

This is an alternative option that hasn't been suggested thus far: it is possible to open the txt file in pandas (say in GCP AI Notebooks) and then upload it using the pandas function pandas.DataFrame.to_gbq. This is very straightforward and described in detail here. I find this the optimal solution if one is appending because it is not necessary to describe the schema.

sbecon
  • 79
  • 10