盒子
盒子
文章目录
  1. 异常堆栈分析
  2. 切换线程
  3. 查看变量信息
  4. 反汇编

gdb调试技巧-一个c++程序崩溃coredump的分析例子

最近支援linux项目组,由于代码都是c/c++写的,奔溃的时候需要用gdb去分析coredump文件,记录一个典型的案例备忘。

问题的背景是我们有个程序提供了一套ipc接口给其他应用调用,在某种系统环境下出现了应用调用接口导致我们的进程一直崩溃的情况。

异常堆栈分析

拿到coredump文件进入gdb之后它会告诉我们这个coredump是由于/usr/bin/Demo程序在shared_ptr<Database>里遇到了SIGSEGV导致的奔溃:

1
2
3
4
5
6
Core was generated by `/usr/bin/Demo'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 std::__shared_ptr<Database, (__gnu_cxx::_Lock_policy)2>::__shared_ptr (this=0xf5e1b58c, this@entry=0xf5e1b554)
at /usr/include/c++/11.5.0/bits/shared_ptr_base.h:1152
1152 /usr/include/c++/11.5.0/bits/shared_ptr_base.h: No such file or directory.
[Current thread is 1 (LWP 13645)]

我们通过bt命令打印函数堆栈:

1
2
3
4
5
6
7
8
9
(gdb) bt
#0 std::__shared_ptr<Database, (__gnu_cxx::_Lock_policy)2>::__shared_ptr (this=0xf5e1b58c, this@entry=0xf5e1b554)
at /usr/include/c++/11.5.0/bits/shared_ptr_base.h:1152
#1 std::shared_ptr<Database>::shared_ptr (this=0xf5e1b58c, this@entry=0xf5e1b554) at /usr/include/c++/11.5.0/bits/shared_ptr.h:150
#2 Context::GetDatabase (this=0x0)
at /home/linjw/workspace/demo_code/src/common/context.cpp:141
#3 0x00688206 in AudioServiceImpl::GetVolume (this=<optimized out>)
at /home/linjw/workspace/demo_code/impl/src/audio/audio_service_impl.cpp:68
...

可以看到的确是在AudioServiceImpl::GetVolume处理获取音量请求的时候奔溃的:

1
2
3
int AudioServiceImpl::GetVolume() {
return context_.lock()->GetDatabase()->Get(DB_VOLUME, 50);
}

从堆栈里面可以看到一个奇怪的地方Context::GetDatabase (this=0x0)Context的this指针是空指针,意味着context_还没有初始化这个就被调用了。

分析下源码,这个程序的初始化流程是这样的:

所以应该是在初始化操作里面卡住了导致AudioServiceImpl::OnCreate没有调用到,然后这个时候应用去发送udi请求的时候就会在子线程里面调用到AudioServiceImpl::GetVolume导致奔溃。

所以这里有两个问题:

  1. 初始化操作卡住了
  2. 应该修改下初始化流程在初始化完成之后才注册ipc接口

切换线程

然后我们看看初始化到底是卡在了哪里,由于我们是在主线程做的初始化,所以线程的id和进程id是一样的,我们可以通过info inferior命令看到pid是13639:

1
2
3
(gdb) info inferior
Num Description Connection Executable
* 1 process 13639 1 (core) /home/linjw/gdb/sysroots/usr/bin/Demo

这个13639其实也是我们主线程的线程Target Id,然后我们用info threads列出所有的线程,找到它的id是19:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
(gdb) info threads
Id Target Id Frame
* 1 LWP 13645 std::__shared_ptr<Database, (__gnu_cxx::_Lock_policy)2>::__shared_ptr (this=0xf5e1b58c, this@entry=0xf5e1b554)
at /usr/include/c++/11.5.0/bits/shared_ptr_base.h:1152
2 LWP 13642 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
3 LWP 13652 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
4 LWP 13647 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
5 LWP 13662 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
6 LWP 13654 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
7 LWP 13648 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
8 LWP 13655 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
9 LWP 13644 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
10 LWP 13656 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
11 LWP 13643 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
12 LWP 13657 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
13 LWP 13659 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
14 LWP 13649 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
15 LWP 13650 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
16 LWP 13651 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
17 LWP 13660 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
18 LWP 13661 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
19 LWP 13639 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47

接着可以通过thread 19命令将上下文环境切到主线程:

1
2
3
4
(gdb) thread 19
[Switching to thread 19 (LWP 13639)]
#0 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
47 ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S: No such file or directory.

然后再用bt命令查看19号线程的堆栈:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
(gdb) bt
#0 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
#1 0xf7a98418 in __GI___ioctl (fd=0, request=1075858688) at ../sysdeps/unix/sysv/linux/ioctl.c:35
#2 0x00699160 in SetEnable (enable=<optimized out>)
at /home/linjw/workspace/demo_code/impl/src/system/mcu/mcu.cpp:55
#3 0x00620908 in BaseService::BaseService(std::weak_ptr<ILifecycle>)::{lambda(std::weak_ptr<Context>)#1}::operator()(std::weak_ptr<Context>) const (context=..., __closure=0x2245258)
at /home/linjw/workspace/demo_code/./src/common/base_service.h:45
#4 std::__invoke_impl<void, BaseService::BaseService(std::weak_ptr<ILifecycle>)::{lambda(std::weak_ptr<Context>)#1}&, std::weak_ptr<Context> >(std::__invoke_other, BaseService::BaseService(std::weak_ptr<ILifecycle>)::{lambda(std::weak_ptr<Context>)#1}&, std::weak_ptr<Context>&&) (
__f=...) at /usr/include/c++/11.5.0/bits/invoke.h:61
#5 std::__invoke_r<void, BaseService::BaseService(std::weak_ptr<ILifecycle>)::{lambda(std::weak_ptr<Context>)#1}&, std::weak_ptr<Context> >(BaseService::BaseService(std::weak_ptr<ILifecycle>)::{lambda(std::weak_ptr<Context>)#1}&, std::weak_ptr<Context>&&) (__fn=...)
at /usr/include/c++/11.5.0/bits/invoke.h:111
#6 std::_Function_handler<void (std::weak_ptr<Context>), BaseService::BaseService(std::weak_ptr<ILifecycle>)::{lambda(std::weak_ptr<Context>)#1}>::_M_invoke(std::_Any_data const&, std::weak_ptr<Context>&&) (__functor=..., __args#0=...)
at /usr/include/c++/11.5.0/bits/std_function.h:290
#7 0x0061a908 in std::function<void (std::weak_ptr<Context>)>::operator()(std::weak_ptr<Context>) const (__args#0=..., this=<optimized out>)
at /usr/include/c++/11.5.0/bits/std_function.h:247
#8 LifecycleObserver::OnCreate (context=..., this=<optimized out>)
at /home/linjw/workspace/demo_code/./src/common/lifecycle.h:47
#9 Context::Init (this=0x223f95c, dispatcher=...)
at /home/linjw/workspace/demo_code/src/common/context.cpp:61
#10 0x005b9fc8 in main (argc=<optimized out>, argv=<optimized out>)
at /home/linjw/workspace/demo_code/src/app/main.cpp:49

可以看到它卡在了mcu.cpp的第55行ioctl里面:

1
ioctl(ioctl_cmd.fd, DEMO_REQUEST, (unsigned long)&ioctl_cmd.args);

查看变量信息

但是还有个比较奇怪的地方是在堆栈信息里面看到传给ioctl的fd是0:

1
#1  0xf7a98418 in __GI___ioctl (fd=0, request=1075858688) at ../sysdeps/unix/sysv/linux/ioctl.c:35

fd为0的话表示是标准输出。从代码上看fd是ioctl_cmd.fd:

1
2
3
4
5
6
7
8
bool SetEnable(bool enable)
{
...
ioctl_cmd.fd = open(DEV_PATH, O_RDWR);
...
ioctl(ioctl_cmd.fd, DEMO_REQUEST, (unsigned long)&ioctl_cmd.args);
...
}

所以我们可以使用f 2切换到SetEnable函数所在的帧:

1
2
3
4
(gdb) f 2
#2 0x00699160 in SetEnable (enable=<optimized out>)
at /home/linjw/workspace/demo_code/impl/src/system/mcu/mcu.cpp:55
55 /home/linjw/workspace/demo_code/impl/src/system/mcu/mcu.cpp: No such file or directory.

然后用p ioctl_cmd打印出ioctl_cmd的值,可以看到fd应该是27才对:

1
2
(gdb) p ioctl_cmd.fd
$1 = 27

我们也可以用 p ioctl_cmd打印出ioctl_cmd的完整数据:

1
2
(gdb) p ioctl_cmd
$2 = {fd = 27, args = {ArgDemo = 0 '\000', Reserve = '\000' <repeats 12 times>}}

这块可能是抓钱的coredump哪里异常导致的,由于这个问题出现之后是必现的,我加了个打印然后再复现,发现fd的确应该不为0,另外在ioctl前后加了打印也确认了的确卡在这里。

反汇编

如果你更加硬核的话可以用disassemble反汇编当前帧的函数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Dump of assembler code for function _Z21SetEnableb:
0x00699090 <+0>: ldrh r0, [r3, r5]
0x00699092 <+2>: movs r5, r0
0x00699094 <+4>: movs r2, #173 ; 0xad
0x00699096 <+6>: movs r3, r1
0x00699098 <+8>: str r1, [r4, #108] ; 0x6c
0x0069909a <+10>: movs r5, r1
0x0069909c <+12>: lsls r4, r2, #4
0x0069909e <+14>: lsls r0, r2, #9
0x006990a0 <+16>: lsls r4, r6, #1
0x006990a2 <+18>: add r0, r0
0x006990a4 <+20>: ldrh r6, [r4, r5]
0x006990a6 <+22>: movs r5, r0
0x006990a8 <+24>: movs r2, #157 ; 0x9d
0x006990aa <+26>: movs r3, r1
0x006990ac <+28>: str r5, [r6, #108] ; 0x6c
0x006990ae <+30>: movs r5, r1
0x006990b0 <+32>: lsls r4, r2, #4
0x006990b2 <+34>: lsls r0, r2, #9
0x006990b4 <+36>: lsls r4, r6, #1
0x006990b6 <+38>: subs r4, #0
0x006990b8 <+40>: ldrh r6, [r6, r5]
0x006990ba <+42>: movs r5, r0
0x006990bc <+44>: add r4, sp, #916 ; 0x394
0x006990be <+46>: movs r3, r2
...

这部分可以参考我之前写的一篇笔记i register查看寄存器值和汇编代码强行分析异常到底是如何发生的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
(gdb) i registers
r0 0x0 0
r1 0x40204d00 1075858688
r2 0x78f728 7927592
r3 0x1 1
r4 0x1 1
r5 0x78c204 7913988
r6 0x2 2
r7 0xff995ba4 4288240548
r8 0x78c204 7913988
r9 0x22456b8 35935928
r10 0x223f970 35912048
r11 0x5ec1d5 6210005
r12 0x36 54
sp 0xff995ad0 0xff995ad0
lr 0x699161 6918497
pc 0x699160 0x699160 <SetEnable(bool)+208>
cpsr 0x810030 8454192
fpscr 0x80000000 -2147483648

PS: gdb的disassemble反汇编处理的指令地址和你直接用objdump反汇编可执行程序出来的指令地址会有偏移,那是因为可执行程序被加载到内存之后指令地址就会有偏移,这个偏移可以/proc/{pid}/maps确认:

1
2
3
4
5
6
7
00590000-00776000 r-xp 00000000 103:0d 2184                              /usr/bin/Demo
00785000-0078e000 r--p 001e5000 103:0d 2184 /usr/bin/Demo
0078e000-00791000 rw-p 001ee000 103:0d 2184 /usr/bin/Demo
...
f7b10000-f7b28000 r-xp 00000000 103:0d 1039 /lib/libgcc_s.so.1
f7b28000-f7b37000 ---p 00018000 103:0d 1039 /lib/libgcc_s.so.1
...