5

Background

I work with Watusimoto on the game Bitfighter. We use a variation of LuaWrapper to connect our c++ objects with Lua objects in the game. We also use a variation of Lua called lua-vec to speed up vector operations.

We have been working to solve a bug for some time that has eluded us. Random crashes will occur that suggest corrupt metatables. See here for Watusimoto's post on the issue. I'm not sure it is because of a corrupt metatable and have seen some really odd behavior about which I wish to ask here.

The Problem Manifestation

As an example, we create an object and add it to a level like this:

t = TextItem.new()
t:setText("hello")
levelgen:addItem(t)

However, the game will sometimes (not always) crash. With an error:

attempt to call missing or unknown method 'addItem' (a nil value)

Using a suggestion given in answer to Watusimoto's post mentioned above, I have changed the last line to the following:

local ok, res = pcall(function() levelgen:addItem(t) end)

if not ok then
    local s = "Invalid levelgen value: "..tostring(levelgen).." "..type(levelgen).."\n"

    for k, v in pairs(getmetatable(levelgen)) do 
        s = s.."meta "..tostring(k).." "..tostring(v).."\n"
    end

    error(res..s)
end

This prints out the metatable for levelgen if something when wrong calling a method from it.

However, and this is crazy, when it fails and prints out the metatable, the metatable is exactly how it should be (with the correct addItem call and everything). If I print the metatable for levelgen upon script load, and when it fails using pcall above, they are identical, every call and pointer to userdata is the same and as it should be.

It is as though the metatable for levelgen is spontaneously disappearing at random.

Would anyone have any idea what is going on?

Thank you

Note: This doesn't happen with only the levelgen object. For instance, it has happened on the TestItem object mentioned above as well. In fact, that same code crashes on my computer at the line levelgen:addItem(t) but crashes on another developer's computer with the line t:setText("hello") with the same error message missing or unknown method 'setText' (a nil value)

Community
  • 1
  • 1
raptor
  • 799
  • 1
  • 5
  • 16
  • Have you tried running your code under Valgrind? You might have a memory corruption issue from C++. – nneonneo Feb 17 '13 at 01:10
  • Hi, yes, we've run it through valgrind several times now, and after fixing the memory bugs that were there (in unrelated code) the problem has still persisted – raptor Feb 17 '13 at 01:19
  • Very curious. Are you able to reduce the issue to a short testcase? – nneonneo Feb 17 '13 at 01:22
  • The code I've shown above is the short test case. It loads on loading a level and will randomly crash (usually not the first time the level loads, but after restarting the level in the same game session, the 2nd or 3rd... sometimes not until the 8th time or so) – raptor Feb 17 '13 at 01:29
  • I take it you can't make this self-contained without pulling in the whole game engine? – nneonneo Feb 17 '13 at 01:30
  • Have you tested this with any other versions of Lua to rule out a rare but possible Lua bug? – Andrew T Finnell Feb 17 '13 at 04:31
  • Oops. I forgot to mention that we use a variation of Lua called 'lua-vec'. I have updated the question. As for testing other versions, I have updated lua-vec library to use Lua 5.1.5 (it was 5.1.4), but no success. We do currently have a branch in our repo to convert to 5.2, but we have a ways to go before it stops crashing :) – raptor Feb 17 '13 at 05:25
  • Is your `__index` the same as the metatable? If so what happens if you factor `__index` into a separate table? – finnw Feb 17 '13 at 14:20
  • Just as a quick note: levelgen is short for level generator, a script that generates items in a game level. Not totally relevant, but I hate weird variable names taken out of context! – Watusimoto Feb 17 '13 at 21:01
  • I'm fairly certain this is a bug in LuaWrapper. I'll let you know when I figure out what exactly. – Alex Feb 21 '13 at 06:03

3 Answers3

2

As with any mystery, you will need to peel it off layer by layer. I recommend going through the same steps Lua is going and trying to detect where the path taken diverge from your expectations:

What does getmetatable(levelgen).__index return? If it's a table, then check its content for addItem. If it's a function, then try to call it with (table, "addItem") and see what it returns.

Check if getmetatable returns reference to the same object before and after the call (or when it fails).

Are there several levels of metatable indirection that the call is going through? If so, try to follow the same path with explicit calls and see where the differences are.

Are you using weak keys that may cause values to disappear if there are no other references?

Can you provide a "default" value when you detect that it fails and continue to see if it "finds" this method again later? Or when it's broken, it's broken for every call after that?

What if you save a proper value for addItem and "fix" it when you detect it's broken?

What if you simply handle the error (as you do) and call it 10 times? Would it show valid results at least once (after it fails)? 100 times? If you keep calling the same method when it works, will it fail? This may help you to come up with a more reproducible error.

I'm not familiar with LuaWrapper to provide more specific questions, but these are the steps I'd take if I were you.

Paul Kulchenko
  • 25,884
  • 3
  • 38
  • 56
2

I strongly suspect the issue is that you have a class or struct similar to this:

struct Foo
{
    Bar bar;
    // Other fields follow
}

And that you've exposed both Foo and Bar to Lua via LuaWrapper. The important bit here is that bar is the first field on your Foo struct. Alternatively, you may have some class that inherits from some other base class and both the derived and base class are exposed to LuaWrapper.

LuaWrapper uses an function called an Identifier to uniquely track each object (like whether or not the given object has already been added to the Lua state). By default it uses the object address as a key. In cases like the one posed above it is possible that both Foo and Bar have the same address in memory, and thus LuaWrapper can get confused.

This may result in grabbing the wrong object's metatable when attempting to look up a method. Clearly, since it's looking at the wrong metatable it won't find the method you want, and so it will appear as if your metatable has mysteriously lost entries.

I've checked in a change that tracks each object's data per-type rather than in one giant pile. If you update your copy LuaWrapper to latest one from the repository I'm fairly certain your problem will be fixed.

Alex
  • 14,973
  • 13
  • 59
  • 94
  • 1
    I think it is possible that your diagnosis is correct, but for the wrong reason. It is possible that the classes are getting confused, but not because of inheritance, but rather because we are cycling objects rapidly and C++ may be reassigning an address for a different object type before Lua's garbage collection has had a chance to clean things up. Subclassing is not the issue here. That said, your recent changes to LuaWrapper may resolve the issue. – Watusimoto Feb 25 '13 at 23:43
  • Wow, thanks for finding this! We'll port over the changes and let you know. – raptor Feb 26 '13 at 01:22
  • hm, I hadn't considered the case of reusing addresses faster than the Lua GC is gc-ing them. Perhaps there's a need for some kind of luaW_forget to instruct LuaWrapper completely forget that it's tracking a given object. An alternate solution might be to write your own identifier function that guarantees a unique value for each object (if you have any way to uniquely identify objects you could leverage that). – Alex Feb 26 '13 at 06:12
1

After merging with upstream (commit 3c54015) LuaWrapper, this issue has disappeared. It appears to have been a bug in LuaWrapper.

Thanks Alex!

raptor
  • 799
  • 1
  • 5
  • 16