1

I am attempting to do a source code transformation of ARM assembly (specifically, ARMv8-A), and I need a formal grammar of this. Ideally of ARMv8-A for ANTLR, but a grammar for any version of ARM with any format would help.

Strangely, I haven't been able to find one. Is there really no formal grammar for any version ARM?

Uclydde
  • 1,414
  • 3
  • 16
  • 33
  • Not sure what you mean by 'formal grammar',AArch64 compilers (eg GCC, Clang) have different syntaxes, though very similar in most cases – user3124812 Jan 21 '22 at 04:23
  • BTW, what does "do a source code transformation" mean ? – user3124812 Jan 21 '22 at 04:23
  • @user3124812 by formal grammar, I mean: https://en.wikipedia.org/wiki/Formal_grammar By source code transformation, I mean: https://en.wikipedia.org/wiki/Source-to-source_compiler – Uclydde Jan 21 '22 at 04:33

1 Answers1

-1

TL-DR; There is no formal grammar, because it does not exist. The formal specification is the binary encoding. All CPU manufactures document the binary encoding. The specific assembler syntax/grammar is left to tool creators to invent.


There are accepted forms of the encoding. For instance register direct and different operand checks that need to be made. Some instructions have multiple encodings (map to multiple binary values that compute similar results) and this can be used by forensics to see what tools were used. Often different assemblers (ARM corp as versus Gnu ARM as) will have support for slightly different notations for operands. In order to circumvent this, people often use the 'C' pre-processor to translate generic assembly to a target assembler using conditional substitution.

There are not usually formal grammars for assembler because they are so simple. It is like a goto program. All loop constructs do not exist in assembler. So everything is very linear. It might have a 'lex' (or syntax) type file of accepted mnemonics and pseudo-instructions, but instructions can be put in any order without an assembler complaining (although it may crash due to garbage register values).

The general documentation is just a mnemonic with a binary encoding. The encoding is simple because the hardware (CPU) just examines certain bits to determine the form. For instance, ALU instructions:

  • ADD - add and don't set condition codes
  • SUB - subtract and don't set condition codes
  • ADDS - add and set condition codes
  • SUBS - subtract and set condition codes.

They have two source registers (R0-R15 are candidates) and one destination (R0-R15). So that typically takes 12bits of 32bits. The ARM has a 'conditional execution' portion that uses four bits. The assembler just needs to select the leading portion (instruction type) and stuff the remaining bits of the operands. It is the same for all architectures. The issue comes with labels, where you need to compute offsets from one instruction to another. This is the main job of an assembler. Otherwise, it is a one-to-one mapping/translation and there is only a very limited grammar.

Related: Pre-processor as an assembler

artless noise
  • 21,212
  • 6
  • 68
  • 105
  • There are ANTLR 4 grammars for some other assembly languages, including Intel 8086: https://github.com/antlr/grammars-v4/blob/master/asm/asm8086/asm8086.g4 Is there something different about 8086 that makes a grammar more useful? – Uclydde Jan 21 '22 at 19:12
  • 1
    Right, that is something that some one produced. It is not part of an x86 manual. So you can create one. However, you can see that it is mostly useless and it is a particular form of x86 assembler of which there are many! This related to the pseudo ops and arguments. An example from the same site is [MASM](https://github.com/antlr/grammars-v4/blob/master/asm/masm/MASM.g4), which is also x86... It is the particular point of the question I asked (Related). I understand your pain; the answer tries to describe the situation. – artless noise Jan 21 '22 at 19:15
  • 1
    Contrast to 'C' and other languages where there is usually a formal BNF specification that is part of the **STANDARD**. This would be in an ARM or Intel document, if there was a standard assembler language. But each tool chain can have different constructs. Usually on the ARM, they have sub-sets which are compatible. In the x86 world the formats can be completely incompatible. – artless noise Jan 21 '22 at 19:27
  • 1
    If this answer is not correct, then the OP is looking for a list of ANTLR grammars for specific implementation of assembler and the question is not relevant to Stack Overflow. – artless noise Jan 25 '22 at 16:17