4

The old BigQuery Export schema wise script is running.It is given below. But I want to replicate this code and write it according to new export schema as we Bigquery schema has been changed. Please help becasue in new BigQuery Export schema I don't find any other corresponding record against event_dim (event_dim is in according to old BigQuery Export schema).

Here is link for BigQuery Export schema: click here

 SELECT user_dim.app_info.app_instance_id
          , (SELECT MIN(timestamp_micros) FROM UNNEST(event_dim)) min_time
          , (SELECT MAX(timestamp_micros) FROM UNNEST(event_dim)) max_time,
                event.name,
                params.value.int_value engagement_time
        FROM `xxx.app_events_*`,
        UNNEST(event_dim) as event,
        UNNEST(event.params) as params,
        UNNEST(user_dim.user_properties) as user_params
        where (event.name = "user_engagement" and params.key = "engagement_time_msec")
        and
                (user_params.key = "access" and user_params.value.value.string_value = "true") and
                PARSE_DATE('%Y%m%d', event.date) >= date_sub("{{upto_date (yyyy-mm-dd)}}", interval {{last n days}} day) and
                PARSE_DATE('%Y%m%d', event.date) <= "{{upto_date (yyyy-mm-dd)}}"

Tried the query below but what I want app_instance, min_time, max_time, event_name, engagement_time at one SELECT statement. And as I am using 'group by', I am not able to get all those (app_instance, min_time, max_time, event_name, engagement_time) at a time. Please help.

 SELECT user_pseudo_id
     , MIN(event_timestamp) AS min_time
      ,MAX(event_timestamp) AS max_time
    FROM `xxx.app_events_*` as T,
       T.event_params,
       T.user_properties,
       T.event_timestamp
    where (event_name = "user_engagement" and event_params.key = "engagement_time_msec")
    and
            (user_properties.key = "access" and user_properties.value.string_value = "true") and
            PARSE_DATE('%Y%m%d', event_date) >= date_sub("{{upto_date (yyyy-mm-dd)}}", interval {{last n days}} day) and
            PARSE_DATE('%Y%m%d', event_date) <= "{{upto_date (yyyy-mm-dd)}}"
    group by 1
dsesto
  • 7,864
  • 2
  • 33
  • 50

2 Answers2

3

It is true that there was a schema change in the Google Analytics for Firebase BigQuery Export. Although there is no clear mapping of the old fields as compared to the new ones, the SQL query that is provided in the documentation in order to migrate existing BQ datasets from the old schema to the new one provides some hints of how have these fields changed.

I share the migration_script.sql SQL query below, just for reference, but let me pin-point the most relevant changes for your use-case:

  • event_dim is mapped as event in the SQL query, but does not have any final representation in the schema, because event_dim is no longer a nested field: UNNEST(event_dim) AS event
  • event_dim.timestamp_micros is mapped as event_timestamp: event.timestamp_micros AS event_timestamp
  • event_dim.name is mapped as event_name: event.name AS event_name
  • event_param.value.int_value is mapped as event_params.value.int_value: event_param.value.int_value AS int_value
  • user_dim.user_properties is mapped as user_properties, and all its nested values follow the same structure: UNNEST(user_dim.user_properties) AS user_property) AS user_properties

So, in summary, the schema change has been focused at unnesting several of the fields for simplicity, in such a way that, for example, instead of having to access event_dim.name (which would require unnesting and complicating the query), you can query directly the field event_name.

Having this in mind, I am sure you will be able to adapt your query to this new schema, and it will probably look way more simple, given that you will not have to unnest so many fields.


Just for clarification, let me share with you a couple of sample BQ queries comparing the old and the new schema (they are using public Firebase tables, so you should be able to run them out-of-the-box):

# Old Schema - UNNEST() required because there are nested fields
SELECT
  user_dim.app_info.app_instance_id,
  MIN(event.timestamp_micros) AS min_time,
  MAX(event.timestamp_micros) AS max_time,
  event.name
FROM
  `firebase-public-project.com_firebase_demo_ANDROID.app_events_20180503`,
  UNNEST(event_dim) AS event
WHERE
  event.name = "user_engagement"
GROUP BY
  user_dim.app_info.app_instance_id,
  event.name

As compared to:

# New Schema - UNNEST() not required because there are no nested fields
SELECT
  user_pseudo_id,
  MIN(event_timestamp) AS min_time,
  MAX(event_timestamp) AS max_time,
  event_name
FROM
  `firebase-public-project.analytics_153293282.events_20180815`
WHERE
  event_name = "user_engagement"
GROUP BY
  user_pseudo_id,
  event_name

These queries are equivalent, but referencing tables with the old and new schema. Please note that, as your query is more complex, you may need to add some UNNEST() in order to access the remaining nested fields in the table.

Additionally, you may want to have a look at these samples that can help you with some ideas on how to write queries with the new schema.


EDIT 2

My understanding is that a query like the one below should allow you to query for all the fields in a single statement. I am grouping by all the non-aggregated/filtered fields, but depending on your use case (this is definitely something you would need to work on your own) you may want to apply a different strategy in order to be able to query the non-grouped fields (i.e. use a MIN/MAX filter, etc.).

SELECT
  user_pseudo_id,
  MIN(event_timestamp) AS min_time,
  MAX(event_timestamp) AS max_time,
  event_name,
  par.value.int_value AS engagement_time
FROM
  `firebase-public-project.analytics_153293282.events_20180815`,
  UNNEST(event_params) as par
WHERE
  event_name = "user_engagement" AND par.key = "engagement_time_msec"
GROUP BY
  user_pseudo_id,
  event_name,
  par.value.int_value

ANNEX

migration_script.sql:

  SELECT
  @date AS event_date,
  event.timestamp_micros AS event_timestamp,
  event.previous_timestamp_micros AS event_previous_timestamp,
  event.name AS event_name,
  event.value_in_usd  AS event_value_in_usd,
   user_dim.bundle_info.bundle_sequence_id AS event_bundle_sequence_id,
  user_dim.bundle_info.server_timestamp_offset_micros as event_server_timestamp_offset,
  (
  SELECT
    ARRAY_AGG(STRUCT(event_param.key AS key,
        STRUCT(event_param.value.string_value AS string_value,
          event_param.value.int_value AS int_value,
          event_param.value.double_value AS double_value, 
          event_param.value.float_value AS float_value) AS value))
  FROM
    UNNEST(event.params) AS event_param) AS event_params,
  user_dim.first_open_timestamp_micros AS user_first_touch_timestamp,
  user_dim.user_id AS user_id,
  user_dim.app_info.app_instance_id AS user_pseudo_id,
  "" AS stream_id,
  user_dim.app_info.app_platform AS platform,
  STRUCT( user_dim.ltv_info.revenue AS revenue,
    user_dim.ltv_info.currency AS currency ) AS user_ltv,
  STRUCT( user_dim.traffic_source.user_acquired_campaign AS name,
      user_dim.traffic_source.user_acquired_medium AS medium,
      user_dim.traffic_source.user_acquired_source AS source ) AS traffic_source,
  STRUCT( user_dim.geo_info.continent AS continent,
    user_dim.geo_info.country AS country,
    user_dim.geo_info.region AS region,
    user_dim.geo_info.city AS city ) AS geo,
  STRUCT( user_dim.device_info.device_category AS category,
    user_dim.device_info.mobile_brand_name,
    user_dim.device_info.mobile_model_name,
    user_dim.device_info.mobile_marketing_name,
    user_dim.device_info.device_model AS mobile_os_hardware_model,
    @platform AS operating_system,
    user_dim.device_info.platform_version AS operating_system_version,
    user_dim.device_info.device_id AS vendor_id,
    user_dim.device_info.resettable_device_id AS advertising_id,
    user_dim.device_info.user_default_language AS language,
    user_dim.device_info.device_time_zone_offset_seconds AS time_zone_offset_seconds,
    IF(user_dim.device_info.limited_ad_tracking, "Yes", "No") AS is_limited_ad_tracking ) AS device,
  STRUCT( user_dim.app_info.app_id AS id,
    @firebase_app_id  AS firebase_app_id,
    user_dim.app_info.app_version AS version,
    user_dim.app_info.app_store AS install_source ) AS app_info,
  (
  SELECT
    ARRAY_AGG(STRUCT(user_property.key AS key,
        STRUCT(user_property.value.value.string_value AS string_value,
          user_property.value.value.int_value AS int_value,
          user_property.value.value.double_value AS double_value,
          user_property.value.value.float_value AS float_value,
          user_property.value.set_timestamp_usec AS set_timestamp_micros ) AS value))
  FROM
    UNNEST(user_dim.user_properties) AS user_property) AS user_properties
FROM
  `SCRIPT_GENERATED_TABLE_NAME`,
  UNNEST(event_dim) AS event
dsesto
  • 7,864
  • 2
  • 33
  • 50
  • sir ,you mentioned "event_dim" is mapped as "event". but I checked both the schema, there is no such Record called "event" in new schema corresponding to "event_dim"! please help – subhendu rana Sep 26 '18 at 12:00
  • I am sorry if my answer was confusing. What I meant is that "event_dim" is named as "event" in the SQL query used for the transformation, but in fact "event_dim" is no longer a *RECORD* column in the schema, so all the nested fields have become "normal" fields. I have updated the answer providing a simple example for you to understand the comparison, but I hope you understand that, for your specific query, you will have to work on re-writing it from your side, with the help of the information I provided in my answer. – dsesto Sep 26 '18 at 12:28
  • sir I tried but in my query I just need the replacement of "event_dim". could you please tell me how " (SELECT MAX(timestamp_micros) FROM UNNEST(event_dim)) " -> can be represented in array in new schema. Beacasue I dont want event_timestamp as single value, I want it as something like above. – subhendu rana Sep 28 '18 at 10:06
  • Hi @subhendurana could you elaborate on what you mean by *can be represented in array*? I have run a query like the one you shared in your question but with a public dataset and [it does not return any array](https://imgur.com/a/qA7gwmx), so I am not sure what is the difference between the query `# Old Schema` and the `# New Schema` that I shared with you. In any case, if you want to learn more about constructing arrays and working with them in Standard SQL, you can have a look at the corresponding [documentation page](https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays). – dsesto Sep 28 '18 at 10:49
  • sorry for the confusion saying it as array which is actually not. I updated my question adding some part below where I tried writing the query according to new schema. Now the problem is **I am not able to get all those (app_instance,min_time,max_time,event_name,engagement_time) at one SELECT statement. please help – subhendu rana Sep 28 '18 at 12:20
  • I believe your original issue is now resolved (how to map the values in the old Export schema vs. the new one), but you have more questions. So as to keep the questions and answers clear for you and the community, we should keep different questions in different posts. As your issues are related, let me edit my answer (***EDIT 2***) in order to see if it helps, but let me clarify: 1) you should be able to keep working on the answer based on the info already provided, 2) you should create a new question for new issues, as these interactions are already getting complicated to follow. – dsesto Oct 01 '18 at 07:32
  • to answer you first question -> sir I already researched out before I posted this question. but could not implement it using something with "even_dim" according to new schema. could you please help me writing the query (according to old schema ) I provided replacing with query according to new bigquery schema? – subhendu rana Oct 01 '18 at 07:55
  • @subhendurana I have posted a new answer, with a simplified version of your query, which I hope is useful for you. I hope you understand that, without knowing your specific use case and meaning of the previous query, there are some things that do not make sense (e.g. the "min_time" and "max_time" fields), and therefore there is no straightforward way to transform them. Removing those fields (which do not make sense anymore in the new schema, because of the reasons explained in my new answer), I was able to translate the query. – dsesto Oct 01 '18 at 15:16
  • Last, I would like to say that this is as far as I will be able to go, so for any further question, you should probably work on them from your side taking as a basis the information I provided, or re-think the query, because maybe it does not make sense in the new schema as it is, so you probably need to adapt it. – dsesto Oct 01 '18 at 15:16
2

As I believe my previous answer provides some general ideas for the Community, I will keep it and write a new one in order to be more specific for your use case.

First of all, I would like to clarify that in order to adapt a query (just like you are asking us to do), one needs to have a clear understanding of the statement, objective of the query, expected results and data to play with. As this is not the case, it is difficult to work with it, even more considering that there are some functionalities that are not clear from the query, for example: in order to obtain the "min_time" and "max_time" for each event, you are taking the min and max value across multiple events, which does not make clear sense to me (it may, depending on your use case, reason why I suggested that it would be better if you could provide more details or work more on the query yourself). Moreover, the new schema "flattens" events, in such a way that each event is written in a different line (you can easily check this by running a SELECT COUNT(*) FROM 'table_with_old_schema' and compare it to SELECT COUNT(*) FROM 'table_with_new_schema'; you will see that the second one has many more rows), so your query does not make sense anymore, because events are not grouped anymore, and then you cannot pick a minimum and maximum between nested fields.

This being clarified, and having removed some fields that cannot be directly adapted to the new schema (you may be able to adapt this from your side, but this would require some additional effort and understanding of what did those fields mean to you in your previous query), here there are two queries that provide exactly the same results, when run against the same table, with different schema:

Query against a table with the old schema:

SELECT
  user_dim.app_info.app_instance_id,
  event.name,
  params.value.int_value engagement_time
FROM
  `DATASET.app_events_YYYYMMDD`,
  UNNEST(event_dim) AS event,
  UNNEST(event.params) AS params,
  UNNEST(user_dim.user_properties) AS user_params
WHERE
  (event.name = "user_engagement"
    AND params.key = "engagement_time_msec")
  AND (user_params.key = "plays_quickplay"
    AND user_params.value.value.string_value = "true")
ORDER BY 1, 2, 3

Query against the same table, with the new schema:

SELECT
  user_pseudo_id,
  event_name,
  params.value.int_value engagement_time
FROM
  `DATASET.events_YYYYMMDD`,
  UNNEST(event_params) AS params,
  UNNEST(user_properties) AS user_params
WHERE
  (event_name = "user_engagement"
    AND params.key = "engagement_time_msec")
  AND (user_params.key = "plays_quickplay"
    AND user_params.value.string_value = "true")
ORDER BY 1, 2, 3

Again, for this I am using the following table from the public dataset: firebase-public-project.com_firebase_demo_ANDROID.app_events_YYYYMMDD, so I had to change some filters and remove some others in order for it to retrieve sensible results against that table. Therefore, feel free to modify or add the ones you need in order for it to be useful for your use case.

dsesto
  • 7,864
  • 2
  • 33
  • 50
  • sir thank you for the clarification. Things are better understandable now. I am working with query and posted the whole query as you asked me to do.I did ask in this question -https://stackoverflow.com/questions/52439556/bigquery-new-schema-session-duration-distribution-table-in-firebase?noredirect=1#comment91827855_52439556. I needed to have session wise distribution table . So my query is kind of amalgamation of 2 queries below. – subhendu rana Oct 02 '18 at 19:56
  • So my query is kind of amalgamation of 2 queries below.https://stackoverflow.com/questions/48670590/session-duration-distribution-table-in-firebase and https://stackoverflow.com/questions/42546815/how-to-calculate-session-and-session-duration-in-firebase-analytics-raw-data – subhendu rana Oct 02 '18 at 20:03
  • If yo have looked into the query you might have understood that I needed to calculate session wise distribution and I need to use event_timestamp somehow to calculate MaxTime and MinTIme app_instance_id wise. could you please suggest some way please? – subhendu rana Oct 03 '18 at 06:31
  • sir could you please help me suggesting any way to calculate max_time and min_time as you have mentioned "your query does not make sense anymore, because events are not grouped anymore, and then you cannot pick a minimum and maximum between nested fields." ??! – subhendu rana Oct 04 '18 at 08:41