How can one create a data flow graph (DFG/SDFG) for any application from its source code

Question

I have done a lot of research to figure out how a DFG can be created for an application from its source code. There are DFG's available online for certain applications such as MP3 Decoder, JPEG compression and H.263 Decoder.

I haven't been able to figure out how I can create a DFG for an application such as HEVC from its source code? Are there any tools which can instantly generate data flow graphs for such elaborate applications or does it have to be done manually?

Please advise me regarding this matter.

EDIT: I used Doxygen for HEVC and I could see how different functions were interacting with each other. However, every function had many entry and exit points and output of Doxygen became too confusing to follow after a while.

I also looked at StreamIt: http://camlunity.ru/swap/Library/Conflux/Stream%20Programming/streamit-cc_stream_graph_programming_language.pdf

It seemed handy but the graphs it generated for even simpler applications (like MP3 Decoder) were too complex. In order to generate a coherent DFG, will I have to re-write the entire source code?

Please clarify: when you say "any application", do you mean "*any* application in *any* programming language"? Or "*any application for some specific programming language"? — Ira Baxter, Jan 05 '17 at 23:45
@IraBaxter Thank you very much for your reply. I mean any application in any programming language. — a_sid, Jan 06 '17 at 00:01
@IraBaxter I am particularly curious about large and complex applications such as HEVC. — a_sid, Jan 06 '17 at 00:38
@IraBaxter It is coded in C++. [https://bitbucket.org/multicoreware/x265/wiki/Coding] — a_sid, Jan 06 '17 at 00:55
@IraBaxter There are data flow graphs available for applications such as MP3 Decoder, H.263 Decoder and so on. How were those data flow graphs developed? — a_sid, Jan 06 '17 at 01:00
Don't know. Maybe somebody used Clang or some such, which, as a modern compiler, probably builds data flow graphs, so that would work for C++. Clang covers C and C++, but wouldn't cover "all languages" by a long shot. — Ira Baxter, Jan 06 '17 at 01:02
@IraBaxter After looking at those graphs, I thought it would be possible to develop those graphs for any application. Is it likely that those graphs were developed by people who programmed the application? — a_sid, Jan 06 '17 at 01:04
Not likely those graphs were developed by hand, especially not if the program has any scale to it. 10 lines of C++ code can produce data flow graphs that will break your eyeballs, see the example I reference in my answer — Ira Baxter, Jan 06 '17 at 10:16
@IraBaxter Thank you for your help. I would have selected your answer sooner but I wanted to study the information you posted in your answer in more detail. — a_sid, Jan 07 '17 at 22:09
@IraBaxter The graph in the link I posted corresponds to real application and has a finite number of nodes. Every node represents a step of the application (Huffman encoding and so on) and the edges represent the data flowing from one step to the other. Does this also require a tool to build? — a_sid, Jan 07 '17 at 22:15

score 7 · Accepted Answer · edited May 23 '17 at 12:32

7

You want to extract data flow graphs from arbitrary languages. You imply you want a single way to do it. This isn't practical by hand... you need a tool.

Such a tool is singularly hard to build.

To do this, for each language you must be able to:

Define the language to the tool, in the form you find it in practice (not just the language reference manual version). C++ in the wild is bent in a lot of funny ways compared to the standard.
Parse programs in the language as found in the field, perhaps as one file, perhaps as tens of thousands; some programs aren't small.
Build structures representing the language elements and their relations to one another (this is often done as an abstract syntax tree)
Determine for each literal what its actual value is; "a\xbc" has very different values depending on whether the language thinks it is ascii or unicode text with escape sequences
Find all the identifiers in the code, and determine for each one which definitional/type information is associated with it according the language scoping rules
Determine the sources of data (literal values, inputs from the outside world, results of expressions) and track where those data values are used in other parts of the program across various control flow constructs
Presumably draw some picture of the resulting data flow.

Each of these tasks by themselves are hard, because the languages tend to be complex. Most language tools that can do this at all (mostly compilers) do it only for one dialect of the language.

To do it for more than one language/dialect, you need a tool that can be configured for all the details for each language, and you have to configure for all the languages of interest. [Realistically you cannot "do them all"; there's thousands of computer languages in use right now].

Even limiting yourself to the "everyday" common programming languages, this is an enormous amount of work; it can take a few years to do all this well for a single mainstream language. You won't succeed in doing this by yourself.

My company builds a single, unified tool that is intended to be capable of doing this: the DMS Software Reengineering Toolkit. The simple "secret" is to realize that the machinery needed to accomplish the above tasks is actually very similar across languages, and can be designed to configured for a particular language with relatively modest (that doesn't mean "small") effort.

After 20 linear years of engineering with a team of PhD level engineers, we have parsers (even this is hard) for a surprising variety of languages, with full up data flow analyzers of the type you are talking about for C++ (check this link for examples), C, COBOL and almost Java 8.

I don't know of any other unified tools that are this far down the path toward your ideal. Check my bio before you decide I'm clueless about this. (Rascal/MPL has some ambitions but is a research tool at this point; they don't do C or C++ at all) We're only part way there, with many languages and scale battles to fight remaining.

[The goal of DMS isn't data flow analysis; that's just a stepping stone. It is to do automated code transformation, which requires data flow analysis to do safely and correctly].

Of course, you could just hope to find a separate tool for each language. You wouldn't get consistent quality or consistent style/granularity of data flow graphs from separate tools from different authors, if you can indeed get a full set of such tools at all.

edited May 23 '17 at 12:32

Community

1
1

answered Jan 06 '17 at 00:42

Ira Baxter

93,541
22
172
341

1

OK, upvote. I don't want to write an answer that competes with this one... BUT the real answer is that the data flows that the OP wants to see are just not present in the data flows of real implementations in most cases. There will usually be things in the source corresponding to the big blocks, but the data flows between them will be complicated and stateful. – Matt Timmermans Jan 06 '17 at 01:43
@MattTimmermans: There's the dataflows in the code, and then there are abstract dataflows across a more abstract design, I think that insight is right. To see the abstract dataflows, you have to find the concrete ones and "abstract them up" while abstracting the code elements at the same time. We are doing some of that for design recovery of factory process control programs. But OP asked specifically for dataflows on the source code. – Ira Baxter Jan 06 '17 at 02:40
@IraBaxter Thank you for the detailed response. I have learned a lot from your answer. – a_sid Jan 07 '17 at 22:03
@IraBaxter I closely read your answer and also the examples on your website. However, I am curious to know how the graph in this paper was made: (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.40.1932&rep=rep1&type=pdf) – a_sid Jan 07 '17 at 22:06
@IraBaxter The graph is on page 6 and corresponds to JPEG. Does one need to use a tool like yours to build a graph like the one shown in the paper? – a_sid Jan 07 '17 at 22:08
"The JPEG algorithm was *specified* as a *task graph*". They drew that graph themselves. It isn't a data flow graph; it specifiies "tasks" (big computational chunks that pass data implicitly) and the order in which the tasks have to run. (arc from A to B means A runs to completion before B starts). Presumably they implement the algorithm as big chunks of code that are sequenced according to that graph. What their paper does (extremely fast skim) is allocate the tasks to various processors to minimize execution time. You don't need my kind of tool to hand-draw this kind of graph. – Ira Baxter Jan 07 '17 at 22:20
mostly because they are making the code structured, by hand, to match the graph, rather than extracting real dataflow [much finer grain] from the code]. ... but, what do you intend to do with this kind of graph? (See my PARLANSE programming language in which you can specify task graphs directly and have them executed on a symmetric multiprocessor: http://www.semanticdesigns.com/Products/Parlanse/ParlanseParallelism.html). – Ira Baxter Jan 07 '17 at 22:22
@IraBaxter Thank you for the swift reply. I intend to use this graph for allocating tasks to different processors (just like the paper in the link). However, I don't see these for many applications (just a few like JPEG and MP3 Decoder, for example). – a_sid Jan 07 '17 at 22:50
@IraBaxter I want to be able to create one for an application like HEVC and we normally have manuals and source codes for these applications. So, will I have to re-structure the code of HEVC to build a similar graph? – a_sid Jan 07 '17 at 22:53
1

Roughly. The notion of that paper (and may "task-graph scheduler" papers like it) is that a very smart engineer figures out the ideal task-graph, and then forces the application structure to match it; then the scheduler guys will claim various kinds of Wonderful Scheduling to minimize execution time. Where this falls apart is that you have working code (e.g., HVEC) that you don't want to break; now you have to invent such a task graph that is ideally schedulable that you bend the HVEC to handle. The first thing you've figure out is the HVEC is a complicated beast... – Ira Baxter Jan 07 '17 at 23:11
1

... and just knowing roughly how it passes data and sequences activities is going to difficult; for that, you *do* want the automated extraction. Then you can decide what parts of the *real* graph are useful, what parts are awkward, and how you want to hand-modify the code to make workable/schedulable task graph. This is way harder than it sounds; people pretty much can't do it for complex applications. Now the idea is to take the real extracted graph, and schedule *that*; but it isn't 15 task nodes, it is thousands at much finer grain. No easy answers. – Ira Baxter Jan 07 '17 at 23:13
The purpose of our PARLANSE language is to let programmers write such fine-graing task graphs directly, and be able to compile and run them. Always. PARLANSE succeeds. Sometimes it finds a lot of parallelism as provided. Sometimes there isn't much, and the programmer gets to rearrange the code. But have a reliable way to code and execute these graphs makes it far, far easier to experiment and optimize. – Ira Baxter Jan 07 '17 at 23:15
@IraBaxter _a very smart engineer figures out the ideal task-graph, and then forces the application structure to match it_. How might this ideal task-graph be created in the first place? I am particularly confused about how the data dependencies are being obtained for each of the nodes. – a_sid Jan 08 '17 at 03:15
@IraBaxter So in order to effectively use PARLANSE, users have to write the codes for their applications in the style of task graphs and then compile and run the codes? – a_sid Jan 08 '17 at 03:19
Each task computes the values it computes. The task graph indicates which tasks *complete* before other tasks run. If task A computes value x, and task A runs before task B, then task B can read value x computed by task A. There is your data dependence. You can turn around another way: if some computation Q needs information x from some computation P, then you can construct a task graph with task A containing computation P, and task B containing computation B. If you determine the smallest possible operations (e.g., "add", "compare") and construct a fine grain data flow graph ... – Ira Baxter Jan 08 '17 at 06:56
... of the type that you asked for as the question behind this discussion, then you can see a vast set of tiny operations connected by dataflows. You get a task graph by grouping various operations into various tasks; if you have 10 task nodes then somehow you have to place all the operations in the task nodes in such a way that each task only uses data computed by a previous task. Pushing vast numbers of tiny operations in small numbers of tasks pretty much doesn't work; too many data dependences, you end up with one task. Doing such partitions is *really* hard with "big grain" tasks ... – Ira Baxter Jan 08 '17 at 06:59
... which is why PARLANSE offers *small* grain tasks. This actually gives you chance to group things in a way that you actually get parallelism. (PARLANSE is quite real as programming langauge; we build working software applications in it of several million lines). – Ira Baxter Jan 08 '17 at 07:00
@IraBaxter I now understand how data dependencies for different nodes are calculated. I want to confirm this about PARLANSE: If I want to program an application using PARLANSE, will **I have to instruct PARLANSE** what to parallelize and what I want to be executed sequentially? – a_sid Jan 08 '17 at 20:14
@IraBaxter Do you think there are any tools which **can detect** parallelism in an application which developers did not program while accounting for parallelism? – a_sid Jan 08 '17 at 20:20
The PARLANSE language design assumes you are going to tell it what the task parallelism is. The compiler is allowed to "discover" more and take advantage of it; the programmer may have insisted that two tasks are in serial order but if the compiler can detect no data dependencies between, it is allow to rearrange the task order. The current PARLANSE compiler doesn't actually do this. Many supercomputing compilers try to detect parallelism with some degree of success. The real problem with complex codes is this is very hard to do correctly, and you can't parallelize safely if wrong. – Ira Baxter Jan 08 '17 at 21:31
What people have figured out over the years is that compiler-discovery of parellism is hard; it is better to write your code in such a way that it is easy to parallelize, and that you can state where the parallelism is. Applications that were not designed to be parallel are hard for compilers and people to parallelize. Often parallelizing an application requires significant rewriting to succeed. Doing data flow analysis tells you what parallelism is *already* there due to (absence of) data dependencies, which is why this is a valuable first step in going parallel. – Ira Baxter Jan 08 '17 at 21:33
@IraBaxter I see. This is very impressive. I now know that you have to know the application thoroughly in order to program it using tools like PARLANSE. If I want to build a task graph for HEVC, I'll have to use a data extraction tool like yours to get the data flow graph, study the interactions among different functions and then modify the code to make a task graph out of it. – a_sid Jan 08 '17 at 21:49
@IraBaxter You have pretty much cleared every question I had regarding building task graphs. Thank you very much again. – a_sid Jan 08 '17 at 21:51
@MattTimmermans If data flows between _blocks are complicated and stateful_, is it not feasible to represent applications with acyclic graphs? – a_sid Apr 01 '17 at 14:33
Iterations in you applications will necessarily cause cycles in your task graph A -> B -> ... Z -> A. Now you have a new problem: if B is waiting for data from A, is waiting for data from the first iteration or the second? Suddenly tracking the actual values becomes an issue. If your task graph is acyclic, only "one set" of data flows from A to B and you can't have a problem tracking "which one".This is not an easy problem to solve with at static task graph; adding tracking data to computed values just slows down the computation. ... – Ira Baxter Apr 01 '17 at 18:07
... sometimes people do this. Sometimes people extend the task graph at runtime by gluing another copy of it in the place where there was a cycle. Now you have to keep track of the task graph at runtime No easy answers. If you want to understand this in great detail, you should go read a paper or book on "dataflow computing". – Ira Baxter Apr 01 '17 at 18:09
@IraBaxter Thank you for the feedback. If one is developing an application (it could be a computer vision or dsp application), how do they take these matters into account? Is this where **data-flow abstraction** comes into play? – a_sid Apr 03 '17 at 18:33
1

People largely do this by reasoning about the data flows manually. Requests for tools to help do this occur a lot in the parallel computation world. Mostly nobody answers these requests; it takes too much machinery to build by hand. Our approach to solving such problems with shared infrastructure is trying to change the economics of building such tools. (Data flow abstraction AFAIK is simply bundling many dataflows from one program region to another into a single one; this in essence produces a control flow ordering constraint on a computation). – Ira Baxter Apr 03 '17 at 19:56
@IraBaxter _Requests for tools to help do this occur a lot in the parallel computation world._ Do you mean that the developers create the application and then they expect to apply these tools to obtain the data flows? – a_sid Apr 05 '17 at 00:31
The Fortran world is full of guys with legacy codes that they want to performance tune. It isn't easy because they don't understand the dataflow well. (After they understand it, they often don't know how to restructure the program to make parallelism possible). Mostly they don't know to look for such tools, because you pretty much can't find them. – Ira Baxter Apr 05 '17 at 04:09
@IraBaxter You also wrote in one of your earlier comments that _To see the abstract dataflows, you have to find the concrete ones and "abstract them up" while abstracting the code elements at the same time. We are doing some of that for design recovery of factory process control programs._ – a_sid Apr 05 '17 at 05:48
@IraBaxter Is there any place where I can read more about this project? – a_sid Apr 05 '17 at 05:52
Unfortunately, no. This is commercial work we have done for Dow Chemical that has a very nice advance in reverse engineeing in it. We have yet to write a good research report on this topic. – Ira Baxter Apr 05 '17 at 06:15
@IraBaxter I see. Do you recommend any book or paper to learn more about dataflow computing? – a_sid Apr 05 '17 at 11:35
1

Start with this introduction to classic dataflow computing: https://www.ece.cmu.edu/~ece740/f13/lib/exe/fetch.php?media=onur-740-fall13-module5.2.1-dataflow-part1.pdf The ideas are good but you can't build real programs that way; the granularity of the operators is too small and the synchronization overhead is too large, it overwhelms the actual computing work. The secret is is aggregate/abstract up; take *groups* of the tiny data flow operators and combine them into a sequential computation; now the overhead is paid on a group basis rather than operator basis... – Ira Baxter Apr 05 '17 at 14:19

How can one create a data flow graph (DFG/SDFG) for any application from its source code

1 Answers1

Linked