This is a very broad and high level question. The answer depends on your logic that consumes the files. File
represents a file on a filesystem so if you have a component that requires the input to be an instance of File
then it is a correct thing to write it to a temp folder locally. Beam doesn't provide a better abstraction for this case.
However I would recommend to look into updating the logic that currently handles Files
to accept other kinds of input as well. You likely hit the issue caused by the lack of separation of concerns and tight coupling. That is you have a component that takes in a File
, opens it, deals with errors while opening it, reads it, parses data from it, maybe even validates and processes the data. All of these are separate concerns and probably should be handled by separate components that you can combine and replace together when needed, for example:
- a class that knows how to deal with a filesystem and turn a path into a byte stream;
- similar class that knows how to deal with getting a file over http (e.g. GCS use case) and turn it into a byte stream;
- a component that knows how to parse the byte stream into data;
- a component that processes the parsed data;
- other things can probably live anywhere;
This way you can easily implement any other sources for your component, compose and test them independently.
For example, you could implement your logic as 2 joined PCollections
, one of which would read from the GCS location directly, parse the text lines, and and process it in the actual business logic before joining it with the other PCollection
.