1

I get a resources exceeded error when I run the following query , how can I rewrite this query to avoid that error ?

SELECT
  XXX_lnaourj_ID,
  XXX_DR_Tma_LC,
  XXX_CR_Tma_LC,
  XXX_Amount_LC,
  XXX_TLAffect_LC,
  XXX_Entry_Desc,
  XXX_DocType,
  XXX_TCode,
  XXX_Company,
  XXX_Posted_By,
  XXX_Parked_By,
  XXX_Accounting_Date,
  XXX_Creation_Date
FROM
  [acl_data.XXX_JE_All_Formatted]
WHERE
  XXX_lnaourj_ID IN (
  SELECT
    XXX_lnaourj_ID
  FROM
    [acl_data.XXX_JE_All_Formatted]
  WHERE
    XXX_Creation_Date - XXX_Accounting_Date > 30 )
ORDER BY
  XXX_lnaourj_ID ASC;
user1965449
  • 2,849
  • 6
  • 34
  • 51

2 Answers2

1

Since you are using same table inside IN semijoin that you use in the main SELECT, query can be simplified to avoid such self semi-join (and avoid doing table scan twice). Depending on how big the result is, removing ORDER BY may help too:

SELECT
  XXX_lnaourj_ID,
  XXX_DR_Tma_LC,
  XXX_CR_Tma_LC,
  XXX_Amount_LC,
  XXX_TLAffect_LC,
  XXX_Entry_Desc,
  XXX_DocType,
  XXX_TCode,
  XXX_Company,
  XXX_Posted_By,
  XXX_Parked_By,
  XXX_Accounting_Date,
  XXX_Creation_Date
FROM
  [acl_data.XXX_JE_All_Formatted]
WHERE
  XXX_Creation_Date - XXX_Accounting_Date > 30
Mosha Pasumansky
  • 13,206
  • 5
  • 32
  • 55
  • i provided same answer yesterday but than realized it returns only part of what original query in question returns - http://stackoverflow.com/a/36188832/5221944 - it still can work depends on nature of actual data, but conceptually not the same as original query – Mikhail Berlyant Mar 31 '16 at 05:04
  • But, I need the data to be ordered by that field , how can I avoid the self semi-join ? – user1965449 Mar 31 '16 at 06:35
  • Also,because the same XXX_lnaourj_ID could be present in both the cases i.e. where XXX_Creation_Date - XXX_Accounting_Date > 30 and XXX_Creation_Date - XXX_Accounting_Date < 30 , however if the difference is > 30 then I need all the rows for the XXX_lnaourj_ID . Hope I am clear. Thank you. – user1965449 Mar 31 '16 at 06:39
  • Based on this comment, I don't think your original query accomplishes this logic. If you have two rows for XXX_lnaourj_ID, one satisfies the > 30, another doesn't, the outer query will deliver both rows. Mosha's query seems to describe what you're seeking logically. Please let us know if we're missing something. If you need to sort ~150 million rows, I suggest leveraging something like Dataproc or Dataflow, as BigQuery has ORDER BY limits with cardinality in the order of magnitutude of dozens of millions of rows. – vgt Mar 31 '16 at 19:50
  • Also, it appears that in your query the innermost GROUP BY is unnecessary. You're deduping down to a unique set of XXX_lnaourj_ID values, and when you do "where XXX_lnaourj_ID in [SET]", duplication doesn't matter. – vgt Mar 31 '16 at 19:57
  • @vgt I need both the rows >30 as well as <30 if there is at least one row where the difference is >30 ,the innermost group by is not in my original query but it is in one of the answers, is there a reference document on how to use Dataflow to run queries , I am under the impression that Dataflow is used for Data transformation in the pipeline only and not to run any queries? – user1965449 Mar 31 '16 at 20:53
  • So sounds like Mosha's query should work, just change the last bit to: WHERE ABS(XXX_Creation_Date - XXX_Accounting_Date) > 30 You are trying to find rows that fit this criteria. Your original query finds IDs that fit this criteria, then return ALL rows for the found ID, regardless of criteria fit. – vgt Apr 01 '16 at 03:43
1

Try cheating Bigquery by adding "group by" in the sub-query. it sometimes help for me.

SELECT
  XXX_lnaourj_ID,
  XXX_DR_Tma_LC,
  XXX_CR_Tma_LC,
  XXX_Amount_LC,
  XXX_TLAffect_LC,
  XXX_Entry_Desc,
  XXX_DocType,
  XXX_TCode,
  XXX_Company,
  XXX_Posted_By,
  XXX_Parked_By,
  XXX_Accounting_Date,
  XXX_Creation_Date
FROM
  [acl_data.XXX_JE_All_Formatted]
WHERE
  XXX_lnaourj_ID IN (
  SELECT
    XXX_lnaourj_ID
  FROM
    [acl_data.XXX_JE_All_Formatted]
  WHERE
    XXX_Creation_Date - XXX_Accounting_Date > 30
  GROUP BY
    XXX_lnaourj_ID )
ORDER BY
  XXX_lnaourj_ID ASC

If this does not work... Bigquery is not a fan of joins... Try this:

Select *
 from  
(
Select *, 
max(XXX_Creation_Date - XXX_Accounting_Date) over(partition by  XXX_lnaourj_ID) as maxdatediff
 From [acl_data.XXX_JE_All_Formatted]
) 
where maxdatediff>30
N.N.
  • 3,094
  • 21
  • 41
  • I am surprised there is no way to do this . Looks like BigQuery after all ,has its own limitations , this would be a simple query in SQL in any DBMS. – user1965449 Mar 31 '16 at 19:05
  • See my edit for another option using window function instead of join – N.N. Mar 31 '16 at 19:11
  • if i would be able to vote twice on @N.N. two-in-answer - i would do so - but because i cannot - i do just ones. such queries usually run with no problem for small table like yours (162 million entries). if you still have problem - something else is going on - i recommend provide your jobid to google people so they can check what exactly the issue – Mikhail Berlyant Mar 31 '16 at 19:29
  • Thank you ,I have not tried this query yet , but this looks like it will only return the rows with XXX_lnaourj_ID where XXX_Creation_Date - XXX_Accounting_Date > 30, I need all the rows for XXX_lnaourj_ID if there is at least one row with XXX_Creation_Date - XXX_Accounting_Date > 30 ,there could be multiple rows with XXX_lnaourj_ID with the difference > or < 30 , I hope I am clear . Thanks again , I really appreciate that . – user1965449 Mar 31 '16 at 20:20
  • The over() part of the window function spreads the result over all the partition records. – N.N. Mar 31 '16 at 21:25
  • meaning records that have maxdatediff <30 will also be pulled ? – user1965449 Mar 31 '16 at 21:43
  • Nope! But all the records for given XXX_lnaourj_ID where at least one record has XXX_Creation_Date - XXX_Accounting_Date > 30 will be pulled – Mikhail Berlyant Mar 31 '16 at 21:46
  • it is important for understanding: records that have maxdatediff <30 will NOT be pulled :o) – Mikhail Berlyant Mar 31 '16 at 21:51
  • That's exactly what is needed , that is pull all records (maxdatediff <30 as well as maxdatediff > 30) for a given XXX_lnaourj_ID if at least one record has maxdatediff > 30 for that XXX_lnaourj_ID . Thank you so much guys N.N and Mikhail. – user1965449 Mar 31 '16 at 21:57
  • should the where clause maxdatediff>30 be inside the parentheses or outside ? – user1965449 Mar 31 '16 at 23:47
  • It should be out. I suggest that you 1st run the subquery to understand the way window functions work. – N.N. Apr 01 '16 at 04:07