2

I'm writing a UDF to process Google Analytics data, and getting the "UDF out of memory" error message when I try to process multiple rows. I downloaded the raw data and found the largest record and tried running my UDF query on that, with success. Some of the rows have up to 500 nested hits, and the size of the hit record (by far the largest component of each row of the raw GA data) does seem to have an effect on how many rows I can process before getting the error.

For example, the query

select 
    user.ga_user_id, 
    ga_session_id, 
        ...
from 
    temp_ga_processing(
        select 
            fullVisitorId, 
            visitNumber, 
                   ...            
        from [79689075.ga_sessions_20160201] limit 100)

returns the error, but

from [79689075.ga_sessions_20160201] where totals.hits = 500 limit 1) 

does not.

I was under the impression that any memory limitations were per-row? I've tried several techniques, such as setting row = null; before emit(return_dict); (where return_dict is the processed data) but to no avail.

The UDF itself doesn't do anything fancy; I'd paste it here but it's ~45 kB in length. It essentially does a bunch of things along the lines of:

function temp_ga_processing(row, emit) {
  topic_id = -1;
  hit_numbers = [];
  first_page_load_hits = [];
  return_dict = {};
  return_dict["user"] = {};
  return_dict["user"]["ga_user_id"] = row.fullVisitorId;
  return_dict["ga_session_id"] = row.fullVisitorId.concat("-".concat(row.visitNumber));
  for(i=0;i<row.hits.length;i++) {
    hit_dict = {};
    hit_dict["page"] = {};
    hit_dict["time"] = row.hits[i].time;
    hit_dict["type"] = row.hits[i].type;
    hit_dict["page"]["engaged_10s"] = false;
    hit_dict["page"]["engaged_30s"] = false;
    hit_dict["page"]["engaged_60s"] = false;

    add_hit = true;
    for(j=0;j<row.hits[i].customMetrics.length;j++) {
      if(row.hits[i].customDimensions[j] != null) {
        if(row.hits[i].customMetrics[j]["index"] == 3) {
          metrics = {"video_play_time": row.hits[i].customMetrics[j]["value"]};
          hit_dict["metrics"] = metrics;
          metrics = null;
          row.hits[i].customDimensions[j] = null;
        }
      }
    }

    hit_dict["topic"] = {};
    hit_dict["doctor"] = {};
    hit_dict["doctor_location"] = {};
    hit_dict["content"] = {};

    if(row.hits[i].customDimensions != null) {
      for(j=0;j<row.hits[i].customDimensions.length;j++) {
        if(row.hits[i].customDimensions[j] != null) {
          if(row.hits[i].customDimensions[j]["index"] == 1) {
            hit_dict["topic"] = {"name": row.hits[i].customDimensions[j]["value"]};
            row.hits[i].customDimensions[j] = null;
            continue;
          }
          if(row.hits[i].customDimensions[j]["index"] == 3) {
            if(row.hits[i].customDimensions[j]["value"].search("doctor") > -1) {
              return_dict["logged_in_as_doctor"] = true;
            }
          }
          // and so on...
        }
      }
    }
    if(row.hits[i]["eventInfo"]["eventCategory"] == "page load time" && row.hits[i]["eventInfo"]["eventLabel"].search("OUTLIER") == -1) {
      elre = /(?:onLoad|pl|page):(\d+)/.exec(row.hits[i]["eventInfo"]["eventLabel"]);
      if(elre != null) {
        if(parseInt(elre[0].split(":")[1]) <= 60000) {
          first_page_load_hits.push(parseFloat(row.hits[i].hitNumber));
          if(hit_dict["page"]["page_load"] == null) {
            hit_dict["page"]["page_load"] = {};
          }
          hit_dict["page"]["page_load"]["sample"] = 1;
          page_load_time_re = /(?:onLoad|pl|page):(\d+)/.exec(row.hits[i]["eventInfo"]["eventLabel"]);
          if(page_load_time_re != null) {
            hit_dict["page"]["page_load"]["page_load_time"] = parseFloat(page_load_time_re[0].split(':')[1])/1000;
          }
        }
        // and so on...  
      }
    }    
  row = null;
  emit return_dict;
}

The job ID is realself-main:bquijob_4c30bd3d_152fbfcd7fd

Grayson
  • 63
  • 1
  • 9
  • 1. The JavaScript processing environment has limited memory available per query. UDF queries that accumulate too much local state may fail due to memory exhaustion; 2. The amount of data that your UDF outputs when processing a single row should be approxiately 5 Mb or less; 3. You might want to provide your UDF code – Mikhail Berlyant Feb 20 '16 at 03:01
  • @MikhailBerlyant Yes, I'm aware of these things (I read Google's UDF page several times before I started writing this). I mentioned above that I tried it on the single largest row, which indicates I'm returning < 5 MB per row. I've edited my question above to include a snippet of my UDF (the whole thing is ~42 kB, too big to paste here). – Grayson Feb 20 '16 at 07:08
  • @Grayson : I've had to hack your code up quite a bit from the example to get it to run (I don't have access to the code in your GCS bucket) - but this example does work with my in-flight changes. I will update again here when the changes are in production. – thomaspark Mar 25 '16 at 18:24

3 Answers3

1

Update Aug 2016 : We have pushed out an update that will allow the JavaScript worker to use twice as much RAM. We will continue to monitor jobs that have failed with JS OOM to see if more increases are necessary; in the meantime, please let us know if you have further jobs failing with OOM. Thanks!

Update : this issue was related to limits we had on the size of the UDF code. It looks like V8's optimize+recompile pass of the UDF code generates a data segment that was bigger than our limits, but this was only happening when when the UDF runs over a "sufficient" number of rows. I'm meeting with the V8 team this week to dig into the details further.

@Grayson - I was able to run your job over the entire 20160201 table successfully; the query takes 1-2 minutes to execute. Could you please verify that this works on your side?


We've gotten a few reports of similar issues that seem related to # rows processed. I'm sorry for the trouble; I'll be doing some profiling on our JavaScript runtime to try to find if and where memory is being leaked. Stay tuned for the analysis.

In the meantime, if you're able to isolate any specific rows that cause the error, that would also be very helpful.

thomaspark
  • 488
  • 3
  • 14
  • Great, thanks. It's not really any specific rows (really just combinations of arbitrary rows) that fail. Do you know roughly what the ETA will be for the analysis and profiling of the JS runtime? – Grayson Feb 26 '16 at 22:30
  • It's top of my list for investigation in March. – thomaspark Feb 29 '16 at 18:00
  • I've been out of office since March 1, just got back yesterday. Looking into this now! – thomaspark Mar 15 '16 at 18:14
  • 1
    We have the same problem with one of our UDFs, and also tested it on the largest record we could find and worked fine. It fails when ran on a lot of records (~ 10mm records) – user3688176 Mar 19 '16 at 23:21
  • 1
    @thomaspark The "UDF out of memory" error is really weird. We run the same query with the same UDF and sometimes it succeeds, others it fails. We also tried running a failing UDF on a table piecemeal, on the first half of the table first, then on the second half, and it worked. We also deleting every variable as soon as it is not used; using fewer variables; all to no avail... Btw, we are running it on a table with 3,500 rows and 6.6MB. And the function goes through a JSON field and finds min/max values. Nothing fancy – user3688176 Mar 21 '16 at 22:44
  • I hear you. Interestingly, we're running much smaller heaps on V8 than the minimum that the V8 team usually tests with, even for really rubbish hardware. So we're exploring some odd boundaries of V8 memory management. Profiling is underway... – thomaspark Mar 23 '16 at 16:51
  • For the moment we avoided the problem by minimizing the depth of if/then blocks. Every time we reduced the depth (without changing the overall workload/logic) it started working for a while; then it failed again and we had to reduce it again, and so on. So I would investigate along those lines. For example, if (c) { if (d) { do... } } might fail whereas if (c && d) { do.. } might succeed – user3688176 Mar 23 '16 at 22:06
  • Ok, I've been digging into our heap and memory settings. I've got a potential fix in the works and am validating against customer queries. @user3688176 : do you have a job id I can check out for repro? – thomaspark Mar 24 '16 at 21:23
  • Fix deployed, OP's query works! @user3688176, digging into yours. – thomaspark Mar 28 '16 at 20:46
  • @thomaspark Sorry for the delay, worked on other things for a while. Unfortunately I'm still running into the same issue, the job ID is realself-main:job_HxotgoXlSfuZJGf9j1GlhIqa7rk – Grayson May 10 '16 at 22:29
  • I have a change coming out this month that will allow JS queries to use 10x more memory - please follow https://code.google.com/p/google-bigquery/issues/detail?id=527 to be notified when this is available. – thomaspark Jul 18 '16 at 23:10
0

A UDF will fail on anything but very small datasets if it has a lot of if/then levels, such as:
if () {
.... if() {
.........if () {
etc

We had to track down and remove the deepest if/then statement.

But, that is not enough. In addition, when you pass the data into the UDF run a "GROUP EACH BY" on all the variables. This will force BQ to send the output to multiple "workers". Otherwise it will also fail.

I've wasted 3 days of my life on this annoying bug. Argh.

user3688176
  • 327
  • 3
  • 13
  • Hang tight, I may have a fix for you. If you can paste a job id for the failing query, I'll see it works with my updates. – thomaspark Mar 24 '16 at 21:29
0

I love the concept of parsing my logs in BigQuery, but I've got the same problem, I get

Error: Resources exceeded during query execution.

The Job Id is bigquery-looker:bquijob_260be029_153dd96cfdb, if that at all helps.

I wrote a very basic parser does a simple match and returns rows. Works just fine on a 10K row data set, but I get out of resources when trying to run against a 3M row logfile.

Any suggestions for a work around?

Here is the javascript code.

function parseLogRow(row, emit) {

  r =  (row.logrow ? row.logrow : "") + (typeof row.l2 !== "undefined" ? " " + row.l2 : "") + (row.l3 ? " " + row.l3 : "")
  ts = null
  category = null
  user = null
  message = null
  db = null
  found = false
  if (r) {
      m = r.match(/^(\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d (\+|\-)\d\d\d\d) \[([^|]*)\|([^|]*)\|([^\]]*)\] :: (.*)/ )
      if( m){
        ts = new Date(m[1])/1000
        category = m[3] || null
        user = m[4] || null
        db = m[5] || null
        message = m[6] || null
        found = true
      }
      else {
        message = r
        found = false
      }
   }

  emit({
    ts:  ts,
    category: category,
    user: user,
    db: db,
    message: message,
    found: found
    });
}

bigquery.defineFunction(
  'parseLogRow',                           // Name of the function exported to SQL
  ['logrow',"l2","l3"],                    // Names of input columns
  [
    {'name': 'ts', 'type': 'timestamp'},  // Output schema
    {'name': 'category', 'type': 'string'},
    {'name': 'user', 'type': 'string'},
    {'name': 'db', 'type': 'string'},
    {'name': 'message', 'type': 'string'},
    {'name': 'found', 'type': 'boolean'},
  ],
  parseLogRow                          // Reference to JavaScript UDF
);
Lloyd Tabb
  • 86
  • 1
  • 1
  • 6