一个没有正确初始化引发的bug

2014-11-27

2021-08-27

optimize

现象

最近在使用cocos2dx2.2.2的wp8实现移植一个公司的项目，在经过一系列配置和调试后游戏已经可以以debug模式在真机上运行，当然debug是很慢的如果要给人演示必须打一个release的版本。于是又经过一系列配置后使用vs2012成功的在真机上跑起了程序，四处点击了下功能都很正常，也许有些细微的bug但是不影响整体表现。我兴高采烈的准备拿给别人看时悲剧发生了，在离开了IDE启动后的程序充满了不可知的crash，随便点点就会crash。难道刚刚看到的都是幻象，于是我又连上了IDE启动了程序，想查查到底是为什么会随机的crash，诡异的事情发生了，在你连接上IDE后，crash再也无法重现了，我第一个想到的是传说中的海森堡Bug：

海森堡Bug有这样一种特性：当你试图跟踪它时，它会突然消失或改变行为。它取自德国物理学家海森堡的不确定性原理（中学课本中可能称作“测不准原理”），其描述和解释的是量子物理中无法同时准确获知某一粒子位置和速度的现象。与其相关的还有一个现 象，被称为观察者效应，即观察某一对象时，无法绝对地做到仅仅观测而不改变该对象，而由观察者效应产生的问题，我们就称之为海森堡Bug。

整理了一下现象表现：在应用程序debug模式下无论是IDE启动，或者用户自启动都不会出现随机触发的crash现象，在应用程序release模式下如果是IDE启动也不会crash，但是一旦用户自启动，即脱离任何开发tool后运行程序，则充满了不可知的crash。

追查问题

首先你得知道所谓的海森堡Bug就是虾扯蛋，问题总是被巧妙的隐藏了起来。只要使用合理的方式就一定可以查出元凶，应用程序在运行时无非就是一系列01字符串在跑，它没有受到什么紫外线照射不会无缘无故的发生变异，也不会由于你的观察行为发生了量子坍塌让某些0变成了1，或者是1变成了0。

但无疑这个问题是非常难差的，因为你没有一个有效的手段来定位问题出现的代码行，在出现问题的时候也就是应用程序崩溃的时候，在一个封闭的手机系统内，我也不知道怎么可以dump出程序的内存来分析，即使dump出了内存，没有对c/c++程序产生的字节码的深刻了解，你也不会知道问题出现在哪里。幸运的是我能知道某个界面一旦加载必然crash，而这个界面又足够的简单，它是由一些图片和Label组成，图片在程序中大量使用应该也不是问题，Label比较特殊它是由一个界面配置文件读取产生的，在屏蔽了有关代码后，crash确实没有出现了，可以首先定位问题出现在Label上面。

但你还是不知道这是一个什么问题导致的crash，一般程序crash是因为空指针，野指针，内存不够这些原因引起的。既然不能很快的定位问题，那么只能通过一句一句的打log来追查问题出现的地方。但不幸的是wp8手机没有一个像android手机那样的logcat来接受程序的错误输出，也就是说你无法看到自己输出的log日志。

大家都知道还有异常捕获这个强有力的工具，于是我在出问题的上层代码段那里try了起来，但还是很郁闷因为如果你catch了，还是不知道catch了什么鬼东西，因为你看不见。所以首先你得解决的问题是如何看见，我在程序界面的最上层加入了一个log输出界面来显示你所想要看到的东西。最后终于让我看到了这是一个由于std::bad_alloc异常产生的crash，顿时有种终于看到了梦寐以求的少女裸体的兴奋感觉，= =！好吧我大绅士了一把。这儿有两篇文章介绍了如何通过xaml界面显示异常。
http://developer.nokia.com/community/wiki/Reporting_unhandled_exceptions_in_your_Windows_Phone_apps
http://www.jimmycollins.org/blog/?p=434

但不幸的是它们都只能捕获托管代码产生的异常，对c/c++产生的异常无能为力。在stackoverflow上面有人解决了这个问题。http://stackoverflow.com/questions/21274063/catching-unhandle-exception-in-c-cx

C and C++ is just too low-level, it was not designed for this kind of trickery. The native code is called “unsafe” for a reason: it is totally your responsibility to write code that among other things never call methods on null pointers. If you fail, anything can happen, including but not limited to e.g. data corruption or security issues. 

But if you really know what you’re doing, the __try  ..  __except keywords should prevent the crash in this case. Or better yet, specify the /Eha compiler flag, and use try { } catch (...) C++ statements — with this flag catch (...) should catch the structured exceptions as well.

现在我们可以确定异常是std::bad_alloc，该异常一般是由于内存分配出错造成的，在查看代码的过程中发现应用层没有内存分配的行为，那么可以确定是cocos2dx引擎本身内部的错误。这样的话在上层添加的显示log的代码不适合安插在引擎内部的代码里面，会带来很多的头文件循环包含问题，如果设计的巧妙也可以规避这个问题，但显然这不是一个正确的问题解决方案。

回想之前的端游开发经验，往往我们会创建一个本地的日志文件，然后去记录一些信息，如果有必要，也可以把这个文件上传给服务器，供开发者分析问题。同样我们也可以在手机的存储空间里面创建这样的文件做记录。这种方法在用户模式下追查问题非常有用，简单的文件读写，获取系统时间，追加写入字符串就ok了，这里有段实现：

inline void write_file_log(const char* _log) {
    string filePath = CCFileUtils::sharedFileUtils()->getWritablePath();
    filePath += "game_log.txt";

    // time
    struct cc_timeval now;
    CCTime::gettimeofdayCocos2d(&now, NULL);

    struct tm *tm;  
    time_t timep = now.tv_sec;  
    tm = localtime(&timep);  
    int year = tm->tm_year + 1900;  
    int month = tm->tm_mon + 1;  
    int day = tm->tm_mday;  
    int hour=tm->tm_hour;  
    int min=tm->tm_min;  
    int second=tm->tm_sec;

    CCString* timeStr = CCString::createWithFormat("%d:%.2d:%.2d %.2d:%.2d:%.2d        ", year, month, day, hour, min, second);
    string log = timeStr->getCString();
    log += _log;
    log += "\n";
    FILE* file = fopen(filePath.c_str(), "a+");
    if (file) {
        fputs(log.c_str(), file);
        fclose(file);
    }
}

通过在代码里面安插很多日志后，最终发现是某个变量的值异常的大，这个值会影响到后面一段内存分配的大小，一下子new了一块很大很大的内存，这在内存受限的移动设备中显然是个引起crash的问题。再继续追查发现CCLabelTTF的boundingbox没有被正确的初始化：

void CCFreeTypeFont::newLine() 
{
    m_currentLine = new FTLineInfo();
    m_currentLine->width = 0;
    m_currentLine->pen.x = 0;
    m_currentLine->pen.y = 0;
    memset(&m_currentLine->bbox, 0, sizeof(m_currentLine->bbox)); // add init boundingbox
}

导致后面这里的一段计算使得m_textWidth的结果不可预知。

void CCFreeTypeFont::endLine() 
{
    if(m_currentLine)
    {
        m_lines.push_back(m_currentLine);
        m_textWidth = max(m_textWidth,m_currentLine->bbox.xMax - m_currentLine->bbox.xMin); // can't predict
        m_textHeight += m_lineHeight;
    }
}

可能在debug和连接IDE的release模式下，这些变量会被正确的初始化为0，或者是统一的一个不可预期的值，这样在xMax-xMin的时候也不会出现一个极大值的情况。但是在release版本的用户模式下（非IDE启动），这两个值变得乱七八糟，相减出来的值可能是一个极大值。

佐证

在游戏引擎架构这本书里有这么一段话：

I mentioned above that it can be very tricky to debug problems using a release build, due primarily to the way the compiler optimizes the code. Ideally, every programmer would prefer to do all of his or her debugging in a debug build. However, this is often not possible. Sometimes a bug occurs so rarely that you’ll jump at any chance to debug the problem, even if it occurs in a release build on someone else’s machine. Other bugs only occur in your release build, but magically disappear whenever you run the debug build. These dreaded release-only bugs are sometimes caused by uninitialized variables, because vari- ables and dynamically allocated memory blocks are often set to zero in debug mode, but are left containing garbage in a release build. Other common causes of release-only bugs include code that has been accidentally omitted from the release build (e.g., when important code is erroneously placed inside an asser- tion statement), data structures whose size or data member packing changes between debug and release builds, bugs that are only triggered by inlining or compiler-introduced optimizations, and (in rare cases) bugs in the compiler’s optimizer itself, causing it to emit incorrect code in a fully optimized build.

Clearly, it behooves every programmer to be capable of debugging prob- lems in a release build, unpleasant as it may seem. The best ways to reduce the pain of debugging optimized code is to practice doing it and to expand your skill set in this area whenever you have the opportunity. Here are a few tips.

Program

现象

追查问题

佐证