Create c parser with ply

Question

I want create parser in python. I create lexical section and now I want create simple parser for c language and I don't know how to create grammar for if or for and I don't know how to pass the detected tokens by lexer to parser in ply and how to test it(parser) with a simple string?

And this is my Lexer:

import ply.lex as lex
import ply.yacc as yacc
#
# List of token names.   This is always required
tokens = [
   'NUMBER',
   'PLUS',
   'MINUS',
   'MULT',
   'DIVIDE',
   'LPAREN',
   'RPAREN',
   'ID',
   'COMMA',
   'SEMICOLON',
   'LEFTBRACE',
   'RIGHTBRACE',
   'ASSIGN',
   'EQUAL',
]

reserved={
    'while' : 'WHILE',
    'else' : 'ELSE',
    'if' : 'IF',
    'for' : 'FOR',
    'switch':'SWITCH',
    'case':'CASE',
    'do' : 'DO',
    'break': 'BREAK',
    'return' : 'RETURN',
    'int' : 'INT',
    'float' : 'FLOAT',
    'double' : 'DOUBLE',
    'continue' : 'CONTINUE',
    'struct' : 'STRUCT',
    'union' : 'UNION',
    'char' : 'CHAR',
    'printf':'PRINTF',
    'scanf' : 'SCANF',
}
tokens += reserved.values()
# Regular expression rules for simple tokens
t_CONTINUE = r'continue'
t_CASE = r'case'
t_ELSE = r'else'
t_BREAK = r'break'
t_INT = r'int'
t_SCANF = r'scanf'
t_UNION = r'union'
t_PRINTF = r'printf'
t_CHAR = r'char'
t_ASSIGN = r'='
t_EQUAL = r'=='
t_LEFTBRACE = r'{'
t_RIGHTBRACE = r'}'
t_PLUS = r'\+'
t_MINUS   = r'-'
t_MULT   = r'\*'
t_DIVIDE  = r'/'
t_LPAREN  = r'\('
t_RPAREN  = r'\)'
t_COMMA = r','
t_SEMICOLON = r';'
t_FOR = r'for'
t_WHILE = r'while'
t_SWITCH = r'switch'
t_STRUCT = r'struct'
t_RETURN = r'return'
t_IF = r'if'
t_DO = r'do'
t_FLOAT = 'float'
t_DOUBLE = r'double'
# A regular expression rule with some action code
#
def t_ID(t):
    r'[a-zA-Z_][a-zA-Z0-9_]*'
    if t.value in reserved:
        t.type = reserved[ t.value ]
    return t
#
def t_NUMBER(t):
  r'\d+'
  try:
    t.value = int(t.value)
  except ValueError:
    print("Line %d: Number %s is too large!" ,(t.lineno,t.value))
    t.value = 0
  return t
#
# Define a rule so we can track line numbers
def t_newline(t):
  r'\n+'
  t.lexer.lineno += len(t.value)

# A string containing ignored characters (spaces and tabs)
t_ignore  = ' \t'

# Error handling rule
def t_error(t):
  print ("Illegal character '%s'" , t.value[0])
  t.lexer.skip(1)
def t_COMMENT(t):
    r'\#.*'
    pass
# No return value. Token discarded
# Build the lexer
lexer=lex.lex()

# Test it out
data = "3 + 4"
data1="int gcd(int u, int v){ if(v==2) return u; else return gcd(v,u-u/v*v);}"
# Give the lexer some input
lexer.input(data)

# Tokenize
while 1:
   tok = lexer.token()
   if not tok: break      # No more input
   print("This is a token: (" , tok.type,", ",tok.value,")")

Thanks for your helps.

A language written in C... parsing C. Interesting. You might want to try [LLVM](http://llvm.org/) instead, but knock yourself out. — Bob Dylan, Jan 04 '16 at 15:44
Here is a Yacc grammar for C: http://www.lysator.liu.se/c/ANSI-C-grammar-y.html — Hugh Bothwell, Jan 04 '16 at 15:46
A parser for real C is a lot harder than it looks, even if somebody hands an baseline grammar which is reasonably correct; nobody writes ANSI C code. You have to worry about which dialects (what GCC accepts is not the same as what MS accepts), the preprocessor (macros, includes, and conditionals). A specific problem is parsing type declarations, which traditional parsers cannot do without incredible hacks. See http://stackoverflow.com/questions/243383/why-cant-c-be-parsed-with-a-lr1-parser/1004737#1004737 (about C++ but the same exact problem exists in C). Then there's MS header files... — Ira Baxter, Jan 04 '16 at 19:04
You could save yourself the effort. A complete C99 has already been written in pure Python, see https://github.com/eliben/pycparser. — James, Jan 05 '16 at 10:45

Create c parser with ply

0 Answers0