How can I parse a C format string in Python?

Question

I have this code in my C file:

printf("Worker name is %s and id is %d", worker.name, worker.id);

I want, with Python, to be able to parse the format string and locate the "%s" and "%d".

So I want to have a function:

>>> my_function("Worker name is %s and id is %d")
[Out1]: ((15, "%s"), (28, "%d))

I've tried to achieve this using libclang's Python bindings, and with pycparser, but I didn't see how can this be done with these tools.

I've also tried using regex to solve this, but this is not simple at all - think about use cases when the printf has "%%s" and stuff like this.

Both gcc and clang obviously do this as part of compiling - have no one exported this logic to Python?

All I want to do, is simply to locate the "%d" and "%s" inside the string - to know their indexes if you will, and not to convert this to a Python print — speller, May 03 '15 at 07:42
you cannot easily parse it with a simple regex, you need to handle char by char. — Antti Haapala -- Слава Україні, May 03 '15 at 07:45
This is of course possible, but not simple, I'll rather avoid it. It's weird that this logic, which is inside gcc and clang, is not available in Python, also in c parsing libraries — speller, May 03 '15 at 07:48
If you want to parse the format string with a regex, you might be able to handle the `%%`-case with lookbehind. On a side note, I think there might be a small typo in the function call (`worker, id` rather than `work.id`). — SwiftsNamesake, May 03 '15 at 08:21
`Both gcc and clang obviously do this as part of compiling` No. This is done at runtime. `gcc` sees simply *a string*. — hek2mgl, May 03 '15 at 08:37
Actually both of them also *do* do this as part of compiling when generating warnings for `-Wformat`. *C compiler is not required to do this*, which does not mean that no C compiler does it :D — Antti Haapala -- Слава Україні, May 03 '15 at 08:53
@AnttiHaapala Hey! You are right! gcc is indeed having a look at the format strings and generates warnings if something isn't ok with printf. However, it does not *compile* them, isn't it? — hek2mgl, May 03 '15 at 10:56

dawg · Accepted Answer · 2015-05-03T21:16:46.180

You can certainly find properly formatted candidates with a regex.

Take a look at the definition of the C Format Specification. (Using Microsofts, but use what you want.)

It is:

%[flags] [width] [.precision] [{h | l | ll | w | I | I32 | I64}] type

You also have the special case of %% which becomes % in printf.

You can translate that pattern into a regex:

(                                 # start of capture group 1
%                                 # literal "%"
(?:                               # first option
(?:[-+0 #]{0,5})                  # optional flags
(?:\d+|\*)?                       # width
(?:\.(?:\d+|\*))?                 # precision
(?:h|l|ll|w|I|I32|I64)?           # size
[cCdiouxXeEfgGaAnpsSZ]            # type
) |                               # OR
%%)                               # literal "%%"

Demo

And then into a Python regex:

import re

lines='''\
Worker name is %s and id is %d
That is %i%%
%c
Decimal: %d  Justified: %.6d
%10c%5hc%5C%5lc
The temp is %.*f
%ss%lii
%*.*s | %.3d | %lC | %s%%%02d'''

cfmt='''\
(                                  # start of capture group 1
%                                  # literal "%"
(?:                                # first option
(?:[-+0 #]{0,5})                   # optional flags
(?:\d+|\*)?                        # width
(?:\.(?:\d+|\*))?                  # precision
(?:h|l|ll|w|I|I32|I64)?            # size
[cCdiouxXeEfgGaAnpsSZ]             # type
) |                                # OR
%%)                                # literal "%%"
'''

for line in lines.splitlines():
    print '"{}"\n\t{}\n'.format(line, 
           tuple((m.start(1), m.group(1)) for m in re.finditer(cfmt, line, flags=re.X)))

Prints:

"Worker name is %s and id is %d"
    ((15, '%s'), (28, '%d'))

"That is %i%%"
    ((8, '%i'), (10, '%%'))

"%c"
    ((0, '%c'),)

"Decimal: %d  Justified: %.6d"
    ((9, '%d'), (24, '%.6d'))

"%10c%5hc%5C%5lc"
    ((0, '%10c'), (4, '%5hc'), (8, '%5C'), (11, '%5lc'))

"The temp is %.*f"
    ((12, '%.*f'),)

"%ss%lii"
    ((0, '%s'), (3, '%li'))

"%*.*s | %.3d | %lC | %s%%%02d"
    ((0, '%*.*s'), (8, '%.3d'), (15, '%lC'), (21, '%s'), (23, '%%'), (25, '%02d'))

score 1 · Answer 2 · answered May 03 '15 at 07:57

1

A simple implementation might be the following generator:

def find_format_specifiers(s):
    last_percent = False
    for i in range(len(s)):
        if s[i] == "%" and not last_percent:
            if s[i+1] != "%":
                yield (i, s[i:i+2])
            last_percent = True
        else:
            last_percent = False

>>> list(find_format_specifiers("Worker name is %s and id is %d but %%q"))
[(15, '%s'), (28, '%d')]

This can be fairly easily extended to handle additional format specifier information like width and precision, if needed.

answered May 03 '15 at 07:57

Greg Hewgill

951,095
183
1,149
1,285

Strangely enough `"%-0.3%"` is a valid format specifier (meaning `"%"` and not using any argument) – 6502 May 03 '15 at 08:18
1

Yeah, as mentioned my answer doesn't handle any extra embellishment between the leading `%` and the type specifier because the OP didn't ask for that. – Greg Hewgill May 03 '15 at 08:21
Sorry for the noise... i realized now that the OP is asking about C formatting strings, not Python old-style formatting strings – 6502 May 03 '15 at 08:23
@6502 Your concerns are valid, even in C. However, writing a top printf parser would be probably too much for sunday morning - at least for me :) – hek2mgl May 03 '15 at 08:25
... I now realized that, and therefore deleted my answer. However for a stable solution we'll *need* a perfect format string parser which handles all the edge cases since those *egde cases* exist in almost every software project. – hek2mgl May 03 '15 at 08:31

score 0 · Answer 3 · answered May 03 '15 at 09:28

this is an iterative code i have written that prints the indexes of %s %d or any such format string

            import re  
            def myfunc(str):
                match = re.search('\(.*?\)',str)
                if match:
                    new_str = match.group()
                    new_str = new_str.translate(None,''.join(['(',')','"'])) #replace the characters in list with none
                    print new_str
                    parse(new_str)
                else:
                    print "No match"

            def parse(str):
                try:
                    g = str.index('%')
                    print " %",str[g+1]," = ",g
                    #replace % with ' '
                    list1 = list(str)
                    list1[str.index('%')] = ' '
                    str = ''.join(list1)

                    parse(str)
                except ValueError,e:
                    return

            str = raw_input()
            myfunc(str)`

hope it helps

Thank you! It's a great start for me, even though it doesn't cover all the cases - such as %*d and stuff like that — speller, May 03 '15 at 11:39

How can I parse a C format string in Python?

3 Answers3