4

Not sure if the question makes sense, but it's what I'm observing. My Azure Function uses a BlobTrigger to process PDF files that are uploaded to a Blob Storage. Things work fine, until I upload several blobs at once, in which case, using the code below I observe the following:

  • The first context.getLogger() correctly logs each blob that triggers the Function.

  • In the Azure File Share, each PDF file is correctly saved.

  • The second context.getLogger() in many cases returns incorrect results (from one of the other files), as if variables are being shared between instances of my Function. Note that lines[19] is unique for each PDF.

  • I notice similar behavior later on in my code where data from the wrong PDF is logged.

EDIT: to be clear, I understand logs aren't going to be in order when multiple instances run in parallel. However, rather than getting 10 unique results for lines[19] when I upload 10 files, the majority of the results are duplicates and this issue worsens later on in my code when based on X I want to do Y, and 9 out of 10 invocations produce garbage data.

Main.class

public class main {
   @FunctionName("veninv")
       @StorageAccount("Storage")
       public void blob(
           @BlobTrigger(
                   name = "blob",
                   dataType = "binary",
                   path = "veninv/{name}") 
               byte[] content,
           @BindingName("name") String blobname,
           final ExecutionContext context
           ) {

         context.getLogger().info("BlobTrigger by: " + blobname + "(" + content.length + " bytes)");

           //Writing byte[] to a file in Azure Functions file storage
               File tempfile = new File (tempdir, blobname);
               OutputStream os = new FileOutputStream(tempfile);
               os.write(content);
               os.close();

               String[] lines  = Pdf.getLines(tempfile);
               context.getLogger().info(lines[19]);
           }
    }

Pdf.class

   public static String[] getLines(File PDF) throws Exception {
           PDDocument doc = PDDocument.load(PDF);
           PDFTextStripper pdfStripper = new PDFTextStripper();
           String text = pdfStripper.getText(doc);
           lines = text.split(System.getProperty("line.separator"));
           doc.close();
           return lines;
   }

I don't really understand what's going on here, so hoping for some assistance.

AlexanderJ
  • 85
  • 2
  • 7
  • Aren't those files getting processed in parallel? Your functions can serve multiple requests at same time, and you cannot really expect logs to be in order. – Abhilash Aug 28 '20 at 21:45
  • Open your files and check their content. It won't be shared and similarly the variables. – Abhilash Aug 28 '20 at 21:49
  • When I go to the Monitor > Invocations and check each invocation, aren't the logs kept together here? In either case, when I upload 10 files I would expect 10 different lines being printed (regardless in what order they come), but at times I just receive multiple duplicate lines all from the same file . – AlexanderJ Aug 28 '20 at 21:51
  • After a lot of debugging I'm now convinced variables do leak from one instance to another. I have several cases now where logging the same variable multiple times without doing anything to it shows it changing at random (based on data from another instance). I think for my solution I'll need to use a Queue to prevent more than 1 blob from being processed at a time. – AlexanderJ Aug 29 '20 at 00:03
  • @AlexanderJ, Since now the root cause is clear after several discussions over the answers below, could you please mark the most suitable answer as "Accepted" to bring it to conclusion and also to help others who might come across this thread searching with similar problem. – krishg Sep 04 '20 at 23:27

3 Answers3

5

Yes. Azure function invocations can share variables. I'd need to see all the code to be 100% certain, but it looks like the lines object is declared as static and it could be shared across invocations. Let's try changing from a static String[] to String[] and see if the problem goes away?

Azure functions are easy to get off the ground, it's easy to forget about the execution environment. Your functions invocations aren't as isolated as they appear. There is a parent thread calling your function, and many static variables aren't "thread safe." Static variable represents a global state so it is globally accessible. Also, it is not attached with any particular object instance. The "staticness" of the variable relates to the memory space it sits in not it’s value. So, the same variable is accessible from all class instances in which it is referenced.

PS. You've solved the issue in your answer here by reducing concurrency, but that may come at a cost to scalability. I'd recommend load testing that. Also static variables can be useful. Many are thread-safe and you want to use them in Azure functions, such as your httpClient or sqlClient DB connections! Give number three a read, here.

Troy Witthoeft
  • 2,498
  • 2
  • 28
  • 37
  • So this has nothing to do with Azure Function :). Obviously 'static' field would cause this issue anywhere sharing across multiple instances of the class. And Function by no means claims 'total' isolation of each execution. Thanks for detailed reply. – krishg Aug 30 '20 at 09:41
  • That's correct. It's most likely a 'static' thing. But, I think it's fair point out that these types of issues are less intuitive to detect because Azure functions and "serverless" in general encourage us not to think about the servers (execution environment). Have seen a few devs (myself included) get lulled into ignoring static variables due to the "magic" of serverlesss. It's awesome and fun, but also easy to forget there is indeed a server (shared environment) calling your function. – Troy Witthoeft Aug 31 '20 at 12:32
  • Yeah, but that might not be totally a fair expectation. Technically "Serverless" helps you to build and run your applications without "THINKING" about (provisioning or managing) and paying for "servers" unlike some other PaaS offerings. It does not necessarily mean it can magically handle "application" level quirks like this (and neither it claims anywhere!). That's why you still need to think about certain aspects e.g. https://docs.microsoft.com/en-us/azure/azure-functions/functions-best-practices . And I agree with you sometimes easy to forget that there are servers behind server-less :) – krishg Aug 31 '20 at 13:29
  • Meh, I'm just trying to cut OP a little slack. This really IS a complex topic. Don't use static is a good rule of thumb, but is an oversimplification to be sure. This is REALLY about thread safety, and some static variables aren't thread-safe `static String[]` and others `static HttpClient` are! Who knows until peeking at their source? And I don't know about you, but that's not how I spend my Fridays. :D For some good uses of azure function static variables (including DB connections) See number 3 here = https://medium.com/@raunaknarooka/tips-and-tricks-for-azure-functions-c9db11aee3dc. – Troy Witthoeft Aug 31 '20 at 20:21
  • yes, static is recommended for some stuffs like connection management (e.g. http client)..it's already documented in the MS official best practices link I shared in my comment (point #6 https://learn.microsoft.com/en-us/azure/azure-functions/functions-best-practices#share-and-manage-connections )... but not anywhere like business rule state (it advises to be stateless in point #3 https://learn.microsoft.com/en-us/azure/azure-functions/functions-best-practices#write-functions-to-be-stateless ) – krishg Aug 31 '20 at 23:14
  • Apologies for the late reply, but thanks a lot, this was exactly it! I'm not an experienced programmer yet and was using static-method across my classes without fully understanding the functionality of it. I will make sure to pay more attention to these things next time :) – AlexanderJ Sep 04 '20 at 22:41
  • Glad you got it sorted! Only knew the answer because it got me too! Good luck out there. – Troy Witthoeft Sep 06 '20 at 02:11
  • lost a day debugging :-( – Valentin Petkov Feb 03 '22 at 00:57
  • @Valentin_Petkov. Yes, it is NOT fun to learn this. Since this answer was written - two years ago in 2020 - it looks like dotnet 5 and dotnet 6 let you use [isolated functions](https://learn.microsoft.com/en-us/azure/azure-functions/dotnet-isolated-process-guide#:~:text=NET%20isolated%20process%20functions%2C%20which,NET%205.0%20and%20) that mitigate this. – Troy Witthoeft Feb 03 '22 at 04:31
1

No, it's quite hard to believe that function can have such a serious issue. I see some potential problems which might be causing this in your case:

  1. Are you sure you are uploading to a different unique blob for each file every time? You can check by logging the blobname param.
  2. Since you store the file in temp directory File tempfile = new File (tempdir, blobname);, if the blob name is same as mentioned in #1, it would overwrite with last write wins. If it's possible to construct pdf directly from bytes or stream, you can consider that instead of creating an intermediate file in filesystem. If I am not wrong you are using PDFBox which has support to load from byte[] https://pdfbox.apache.org/docs/2.0.3/javadocs/index.html?org/apache/pdfbox/pdmodel/PDDocument.html (check the load method overload which accepts byte[]). I have also answered your another question related to this.
  3. Check if you have static field causing this.
  4. You don't need to use a separate queue which you are thinking to introduce. Though you won't need it at all if the actual issue is fixed, Blob trigger already uses internal queue, default concurrency is 24, but you can configure it in host.json. https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-storage-blob-trigger?tabs=java#concurrency-and-memory-usage

UPDATE:

Looks like in your pdf class you declared 'lines' somewhere outside the method as static which is the root cause of this problem. It's nothing to do with function, but the devil of static :)

Below is the correct code (notes 'lines' variable is now local to the method):

public static String[] getLines(File PDF) throws Exception {
           PDDocument doc = PDDocument.load(PDF);
           PDFTextStripper pdfStripper = new PDFTextStripper();
           String text = pdfStripper.getText(doc);
           String[] lines = text.split(System.getProperty("line.separator"));
           doc.close();
           return lines;
   }
krishg
  • 5,935
  • 2
  • 12
  • 19
  • 1
    1. All blobs have a unique name and represent unique files. When I run my program, for x blobs I end up with x unique files in tempdir as expected. 2. Thanks I wasn't aware, I'll try to load the byte[] directly and see if it changes anything. 3. Yeah, I came across that documentation a little bit earlier and I'll definitely look at adding that to my Function as I would like to limit the number of concurrent SQL queries and Sharepoint uploads anyway. – AlexanderJ Aug 29 '20 at 15:44
  • Apologies for the late reply, but thanks a lot, this was exactly it! I'm not an experienced programmer yet and was using static-method across my classes without fully understanding the functionality of it. I will make sure to pay more attention to these things next time :) – AlexanderJ Sep 04 '20 at 22:41
  • No problem. Glad that you reached the community. 'Static' can sometimes end up being a Monster or Saviour for even experienced developers being clueless :). – krishg Sep 04 '20 at 22:59
0

Just wanting to share that changing host.json to the following, to stop concurrent function invocation, appears to have fixed my issue:

{
    "version": "2.0",
    "extensions": {
        "queues": {
            "batchSize": 1,
            "newBatchThreshold": 0
        }
    }
}

Massive thanks to @KrishnenduGhosh-MSFT for their help. I'm still unsure why concurrent function invocation led to the issues I was experiencing, but given that my program also connects to a SQL database and Sharepoint site (both which are throttled) sequential-processing is the best solution regardless.

AlexanderJ
  • 85
  • 2
  • 7
  • Glad to hear that. Thanks. But I could be happier to find out the root cause of your actual issue of concurrent blob trigger. I will try to simulate your scenario from the code you shared in your question (hopefully that's all of of it) and see if I can catch the devil :). – krishg Aug 29 '20 at 20:13