A procedure for parsing an x86 machine code instruction

Question

I have been thinking about disassemblers and how to go from machine code back into assembly (or machine code back into some intermediate form that can be evaluated like in a VM). That led me to xed which is a nice project, but it is extremely complex and difficult to understand. I found the rough piece of code which I was looking for which essentially comes down to:

decode(decoder) {
  prefix_scanner(decoder)
  opcode_scanner(decoder)
  modrm_scanner(decoder)
  sib_scanner(decoder)
  disp_scanner(decoder)
  imm_scanner(decoder)
}

That kind of hints at a basic process for parsing instruction input bits into a struct or object of some sort.

This one might even be better, but it doesn't have much activity, though it says it's thorough and complete. They have this function (kind of uniquely implemented code, where everything is global variables... seems like a demo more than a module):

function DecodeInstruction()
{
  //Reset Prefix adjustments, and vector setting adjustments.

  Reset();

  var out = ""; //The instruction code that will be returned back from this function.

  //Record the starting position.

  InstructionPos = GetPosition();

  //First read any opcodes (prefix) that act as adjustments to the main three operand decode functions ^DecodeRegValue()^,
  //^Decode_ModRM_SIB_Address()^, and ^DecodeImmediate()^.

  DecodePrefixAdjustments();

  //Only continue if an invalid opcode is not read by DecodePrefixAdjustments() for cpu bit mode setting.

  if( !InvalidOp )
  {
    //Decode the instruction.

    DecodeOpcode();

    //-------------------------------------------------------------------------------------------------------------------------
    //Intel Larrabee CCCCC condition codes.
    //-------------------------------------------------------------------------------------------------------------------------

    if( Opcode >= 0x700 && Instruction.slice(-1) === "," )
    {
      Instruction = Instruction.split(",");

      //CMP conditions.

      if( Opcode >= 0x720 && Opcode <= 0x72F )
      {
        IMMValue = VectorRegister >> 2;

        if( Float || ( IMMValue !== 3 && IMMValue !== 7 ) )
        {
          Instruction = Instruction[0] + ConditionCodes[IMMValue] + Instruction[1];
        }
        else { Instruction = Instruction[0] + Instruction[1]; }

        IMMValue = 0; VectorRegister &= 0x03;
      }

      //Else High/Low.

      else
      {
        Instruction = Instruction[0] + ( ( ( VectorRegister & 1 ) === 1 ) ? "H" : "L" ) + Instruction[1];
      }
    }

    //Setup the X86 Decoder for which operands the instruction uses.

    DecodeOperandString();

    //Now only some instructions can vector extend, and that is only if the instruction is an SIMD Vector format instruction.

    if( !Vect && Extension > 0 && Opcode <= 0x400 ) { InvalidOp = true; }

    //The Width Bit setting must match the vector numbers size otherwise this create an invalid operation code in MVEX/EVEX unless the Width bit is ignored.

    if( Vect && !IgnoresWidthbit && Extension >= 2 )
    {
      InvalidOp = ( ( SIMD & 1 ) !== ( WidthBit & 1 ) ); //Note use, and ignore width bit pastern EVEX.
    }
    if( Opcode >= 0x700 ) { WidthBit ^= IgnoresWidthbit; } //L1OM Width bit invert.
  }

  //If the instruction is invalid then set the instruction to "???"

  if( InvalidOp )
  {
    out = "???" //set the returned instruction to invalid
  }

  //Else finish decoding the valid instruction.

  else
  {
    //Decode each operand along the Decoder array in order, and deactivate them.

    DecodeOperands();

    /*-------------------------------------------------------------------------------------------------------------------------
    3DNow Instruction name is encoded by the next byte after the ModR/M, and Reg operands.
    -------------------------------------------------------------------------------------------------------------------------*/

    if( Opcode === 0x10F )
    {
      //Lookup operation code.

      Instruction = M3DNow[ BinCode[CodePos] ]; NextByte();

      //If Invalid instruction.

      if( Instruction === "" || Instruction == null )
      {
        Instruction = "???"; InsOperands = "";
      }
    }

    /*-------------------------------------------------------------------------------------------------------------------------
    Synthetic virtual machine operation codes.
    -------------------------------------------------------------------------------------------------------------------------*/

    else if( Instruction === "SSS" )
    {
      //The Next two bytes after the static opcode is the select synthetic virtual machine operation code.

      var Code1 = BinCode[CodePos]; NextByte();
      var Code2 = BinCode[CodePos]; NextByte();

      //No operations exist past 4 in value for both bytes that combine to the operation code.

      if( Code1 >= 5 || Code2 >= 5 ) { Instruction = "???"; }

      //Else calculate the operation code in the 5x5 map.

      else
      {
        Instruction = MSynthetic[ ( Code1 * 5 ) + Code2 ];

        //If Invalid instruction.

        if( Instruction === "" || Instruction == null )
        {
          Instruction = "???";
        }
      }
    }

    //32/16 bit instructions 9A, and EA use Segment, and offset with Immediate format.

    if( Opcode === 0x9A || Opcode === 0xEA )
    {
      var t = InsOperands.split(",");
      InsOperands = t[1] + ":" +t[0];
    }

    //**Depending on the operation different prefixes replace others for  HLE, or MPX, and branch prediction.
    //if REP prefix, and LOCK prefix are used together, and the current decoded operation allows HLE XRELEASE.

    if(PrefixG1 === Mnemonics[0xF3] && PrefixG2 === Mnemonics[0xF0] && XRelease)
    {
      PrefixG1 = "XRELEASE"; //Then change REP to XRELEASE.
    }

    //if REPNE prefix, and LOCK prefix are used together, and the current decoded operation allows HLE XACQUIRE.

    if(PrefixG1 === Mnemonics[0xF2] && PrefixG2 === Mnemonics[0xF0] && XAcquire)
    {
      PrefixG1 = "XACQUIRE"; //Then change REP to XACQUIRE
    }

    //Depending on the order that the Repeat prefix, and Lock prefix is used flip Prefix G1, and G2 if HLEFlipG1G2 it is true.

    if((PrefixG1 === "XRELEASE" || PrefixG1 === "XACQUIRE") && HLEFlipG1G2)
    {
      t = PrefixG1; PrefixG1 = PrefixG2; PrefixG2 = t;
    }

    //if HT is active then it is a jump instruction check and adjust for the HT,and HNT prefix.

    if(HT)
    {
      if (SegOverride === Mnemonics[0x2E])
      {
        PrefixG1 = "HNT";
      }
      else if (SegOverride === Mnemonics[0x3E])
      {
        PrefixG1 = "HT";
      }
    }

    //else if Prefix is REPNE switch it to BND if operation is a MPX instruction.

    if(PrefixG1 === Mnemonics[0xF2] && BND)
    {
      PrefixG1 = "BND";
    }

    //Before the Instruction is put together check the length of the instruction if it is longer than 15 bytes the instruction is undefined.

    if ( InstructionHex.length > 30 )
    {
      //Calculate how many bytes over.

      var Dif32 = ( ( InstructionHex.length - 30 ) >> 1 );

      //Limit the instruction hex output to 15 bytes.

      InstructionHex = InstructionHex.substring( 0, 30 );

      //Calculate the Difference between the Disassembler current position.

      Dif32 = Pos32 - Dif32;

      //Convert Dif to unsignified numbers.

      if( Dif32 < 0 ) { Dif32 += 0x100000000; }

      //Convert to strings.

      for (var S32 = Dif32.toString(16) ; S32.length < 8; S32 = "0" + S32);
      for (var S64 = Pos64.toString(16) ; S64.length < 8; S64 = "0" + S64);

      //Go to the Calculated address right after the Instruction UD.

      GotoPosition( S64 + S32 );

      //Set prefixes, and operands to empty strings, and set Instruction to UD.

      PrefixG1 = "";PrefixG2 = ""; Instruction = "???"; InsOperands = "";
    }

    //Put the Instruction sequence together.

    out = PrefixG1 + " " + PrefixG2 + " " + Instruction + " " + InsOperands;

    //Remove any trailing spaces because of unused prefixes.

    out = out.replace(/^[ ]+|[ ]+$/g,'');

    //Add error suppression if used.

    if( Opcode >= 0x700 || RoundMode !== 0 )
    {
      out += RoundModes[ RoundMode ];
    }

    //Return the instruction.
  }

  return( out );
}

My first tangential question is, how do they know how to do this? I didn't see any algorithms for implementing a disassembler in the Intel manuals. And the Intel manuals didn't seem to have any straightforward data ready to be used in code that implements their various tables and such. So it seems to me you would have to do a vast amount of work to summarize the Intel manuals into its essence, discovering every aspect of it. And then you can start to think about implementing a disassembler. Is there an easier way? I would like to see about implementing a disassembler, but would first like to find a good source of computer readable data on the instructions and operands and such, that is nice and clean, to help in the process. Hence my interest in xed and its datafiles. But I don't think xed's going to work, it seems too much of a mess.

The main question though is, what is a pseudocode procedure (or even better, a JavaScript procedure) that implements instruction disassembly (conversion from machine code to anything else, some data structure, or even just emitting tokens or something)? Is the above JavaScript one good enough, or is there a more straightforward implementation? Knowing what it would be like in pseudocode might make things easier to understand.

I am mainly wondering for two reasons. First, I want to implement a disassembler and a generator. Second, I don't see how you can parse through bytes (and bits of bytes) both efficiently and in a way that doesn't require you to scan ahead and back all over the place to figure out the boundaries of a 1-15 byte instruction. I don't see in my head yet how you can easily tell where an instruction starts and end, so having a pseudocode demo show the procedure would make it visible, and make it easier to think about how to write a real parser.

The machine language of Intel processors is huge and complex. The manuals give all information for generating the opcodes, but you need to dive... — , Jan 30 '21 at 11:11
VEX and EVEX prefixes make disassembly more complex, especially in 32-bit mode, but other than that the basic machine-code format is fairly regular: `[prefixes] opcode [modrm [sib] [disp]] [immediate]`, with opcodes being 1 byte, or 0F ?? 2-byte. You just need a table of which opcodes (including mandatory prefixes...) imply a ModRM and/or an immediate of some width to be able to at least find instruction lengths, and only a bit more info (about which order the operands are in, and any special cases) for other instructions. Like `vpblendvb` using an imm8 to encode another register. — Peter Cordes, Jan 30 '21 at 18:35
It’s easy to tell where an instruction begins—it’s immediately after the previous instruction or the target of a jump or call. It’s not so easy to figure out where it ends; in fact it’s pretty complicated. But it doesn’t “require you to scan ahead and back all over the place”, you just look at each byte in turn. — prl, Jan 31 '21 at 00:25
[_Enumerating x86-64 – It’s Not as Easy as Counting_](https://www.unomaha.edu/college-of-information-science-and-technology/research-labs/_files/enumerating-x86-64-instructions.pdf) — Lance, Jan 31 '21 at 07:01
I do stuff like this by ISET database instead of direct op code decoding (however for much simpler platform) see [What's the proper implementation for hardware emulation?](https://stackoverflow.com/a/18911590/2521214) at the bottom of my answer there you see a sample of the database holding all ISET data (and link to the whole thing). This is used for both disassembling and compiling (and of coarse for emulation too) so you might to do similar stuff. — Spektre, Jan 31 '21 at 08:56

A procedure for parsing an x86 machine code instruction

0 Answers0