pointyhat panic

Discussion:

pointyhat panic

(too old to reply)

John Baldwin

2009-06-16 12:12:03 UTC

This is from the RPC re-work. I had thought that this was fixed. You
shouldn't see this on the latest -CURRENT, but Doug will have more
details.

Any datepoint when these fixes went in? I upgraded pointyhat last month
exactly to get the latest fixes in, but could be there were more since
then.

You want the socket upcall locking changes in 193272 (committed June 1). You
will also want subsequent commits to the RPC and NFS code by Rick Macklem to
close a few more races. I think Rick still has one other patch that pho@ is
stress testing as well.

--
John Baldwin

John Baldwin

2009-06-16 12:12:48 UTC

Permalink

cpuid = 2
KDB: enter: panic
[thread pid 0 tid 100029 ]
Stopped at kdb_enter+0x3d: movq $0,0x3f5fb8(%rip)
db> bt
Tracing pid 0 tid 100029 td 0xffffff00018e1000
kdb_enter() at kdb_enter+0x3d
panic() at panic+0x17b
_mtx_lock_flags() at _mtx_lock_flags+0xc5
clnt_vc_soupcall() at clnt_vc_soupcall+0x273
sowakeup() at sowakeup+0xf8
tcp_do_segment() at tcp_do_segment+0x23c9
tcp_input() at tcp_input+0x9ec
ip_input() at ip_input+0xbc
ether_demux() at ether_demux+0x1ed
ether_input() at ether_input+0x171
em_rxeof() at em_rxeof+0x201
em_handle_rxtx() at em_handle_rxtx+0x4b
taskqueue_run() at taskqueue_run+0x96
taskqueue_thread_loop() at taskqueue_thread_loop+0x3f
fork_exit() at fork_exit+0x12a
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xffffffff240a6d40, rbp = 0 ---
The box is in kdb on serial console for now. May 9 -CURRENT, I think.

This happened again. The trigger was this (^C of a find on a busy
netapp volume with a lot of other concurrent nfs traffic to the same
pointyhat# find . -name \*.bz2 -mmin -10
^Cnfs server dumpster:/vol/vol4/pointyhat: not responding
nfs server dumpster:/vol/vol4/pointyhat: not responding
nfs server dumpster:/vol/vol4/pointyhat: not responding
nfs server dumpster:/vol/vol4/pointyhat: not responding
nfs server dumpster:/vol/vol4/pointyhat: not responding
nfs server dumpster:/vol/vol4/pointyhat: not responding
nfs server dumpster:/vol/vol4/pointyhat: not responding
nfs server dumpster:/vol/vol4/pointyhat: not responding
nfs server dumpster:/vol/vol4/pointyhat: not responding
nfs server dumpster:/vol/vol4/pointyhat: not responding
nfs server dumpster:/vol/vol4/pointyhat: not responding
nfs server dumpster:/vol/vol4/pointyhat: not responding
load: 4.54 cmd: find 93357 [rpccon] 11.19u 111.62s 0% 4848k
About 5-10 minutes later the machine panicked. I'll try updating to a
newer -CURRENT.
Kris
This sounds like nearly exactly the same symptoms I noticed on
a -current machine a few months ago, I was doing a du on a
nfs mount, decided to ctrl-c it, got the not responding for a
while and a few minutes after the system paniced. I hadn't
had a chance to report it yet but I did find a workaround,
it is stable if I remove "intr" from the NFS mount options.
Hope this helps a little.

These should be fixed in the latest HEAD. It would be good to
re-enable "intr" and test it before 8.0 is released.

--
John Baldwin

Erwin Lansing

2009-06-16 16:15:27 UTC

Permalink

Post by John Baldwin

This is from the RPC re-work. I had thought that this was fixed. You
shouldn't see this on the latest -CURRENT, but Doug will have more
details.

Any datepoint when these fixes went in? I upgraded pointyhat last month
exactly to get the latest fixes in, but could be there were more since
then.

You want the socket upcall locking changes in 193272 (committed June 1). You
will also want subsequent commits to the RPC and NFS code by Rick Macklem to
stress testing as well.

OK, in that case we definately need to upgrade, my upgrade was before
those. Thanks for the info.

Best,
-erwin

--
Erwin Lansing (o_ _o) http://droso.org
\\\_\ /_///
The rest is silence <____) (____> ***@lansing.dk

Kris Kennaway

2009-06-21 12:10:55 UTC

Permalink

Post by John Baldwin

This is from the RPC re-work. I had thought that this was fixed. You
shouldn't see this on the latest -CURRENT, but Doug will have more
details.

Any datepoint when these fixes went in? I upgraded pointyhat last month
exactly to get the latest fixes in, but could be there were more since
then.

You want the socket upcall locking changes in 193272 (committed June 1). You
will also want subsequent commits to the RPC and NFS code by Rick Macklem to
stress testing as well.

Got another deadlock after upgrading. Again, busy NFS volume, and
^C'ing a recursive find hung in rpccon state:

db> bt 89596
Tracing pid 89596 tid 102493 td 0xffffff0089260000
sched_switch() at sched_switch+0x17c
mi_switch() at mi_switch+0x21d
sleepq_switch() at sleepq_switch+0x123
sleepq_timedwait() at sleepq_timedwait+0x4d
_sleep() at _sleep+0x301
clnt_reconnect_call() at clnt_reconnect_call+0x5d3
nfs_request() at nfs_request+0x225
nfs_statfs() at nfs_statfs+0x197
__vfs_statfs() at __vfs_statfs+0x28
kern_fstatfs() at kern_fstatfs+0x286
fstatfs() at fstatfs+0x34
syscall() at syscall+0x1af
Xfast_syscall() at Xfast_syscall+0xd0
--- syscall (397, FreeBSD ELF64, fstatfs), rip = 0x800726dcc, rsp =
0x7fffffffe1a8, rbp = 0x1000 ---

These are mounted with intr, I'll try disabling that next.

Kris

Kris Kennaway

2009-06-21 12:16:58 UTC

Permalink

Post by Kris Kennaway

Post by John Baldwin

This is from the RPC re-work. I had thought that this was fixed. You
shouldn't see this on the latest -CURRENT, but Doug will have more
details.

Any datepoint when these fixes went in? I upgraded pointyhat last month
exactly to get the latest fixes in, but could be there were more since
then.

You want the socket upcall locking changes in 193272 (committed June
1). You will also want subsequent commits to the RPC and NFS code by
Rick Macklem to close a few more races. I think Rick still has one

Got another deadlock after upgrading. Again, busy NFS volume, and
db> bt 89596
Tracing pid 89596 tid 102493 td 0xffffff0089260000
sched_switch() at sched_switch+0x17c
mi_switch() at mi_switch+0x21d
sleepq_switch() at sleepq_switch+0x123
sleepq_timedwait() at sleepq_timedwait+0x4d
_sleep() at _sleep+0x301
clnt_reconnect_call() at clnt_reconnect_call+0x5d3
nfs_request() at nfs_request+0x225
nfs_statfs() at nfs_statfs+0x197
__vfs_statfs() at __vfs_statfs+0x28
kern_fstatfs() at kern_fstatfs+0x286
fstatfs() at fstatfs+0x34
syscall() at syscall+0x1af
Xfast_syscall() at Xfast_syscall+0xd0
--- syscall (397, FreeBSD ELF64, fstatfs), rip = 0x800726dcc, rsp =
0x7fffffffe1a8, rbp = 0x1000 ---
These are mounted with intr, I'll try disabling that next.
Kris
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-current

Rick Macklem

2009-06-22 01:21:34 UTC

Permalink

Got another deadlock after upgrading. Again, busy NFS volume, and ^C'ing a
db> bt 89596
Tracing pid 89596 tid 102493 td 0xffffff0089260000
sched_switch() at sched_switch+0x17c
mi_switch() at mi_switch+0x21d
sleepq_switch() at sleepq_switch+0x123
sleepq_timedwait() at sleepq_timedwait+0x4d
_sleep() at _sleep+0x301
clnt_reconnect_call() at clnt_reconnect_call+0x5d3
nfs_request() at nfs_request+0x225
nfs_statfs() at nfs_statfs+0x197
__vfs_statfs() at __vfs_statfs+0x28
kern_fstatfs() at kern_fstatfs+0x286
fstatfs() at fstatfs+0x34
syscall() at syscall+0x1af
Xfast_syscall() at Xfast_syscall+0xd0
--- syscall (397, FreeBSD ELF64, fstatfs), rip = 0x800726dcc, rsp =
0x7fffffffe1a8, rbp = 0x1000 ---
These are mounted with intr, I'll try disabling that next.

There are two sleeps in clnt_rc.c. One of them optionally does a PCATCH
and returns when interrupted via ^C, but the other one (which it is
sleeping on above), doesn't. I've emailed Kris a small patch that
changes that for him to test.

If anyone else wants to test the patch, just email me for a copy, rick

Adrenalin

2010-03-09 23:19:30 UTC

Permalink

Hi, I would like to know if this bug has been fixed in the FreeBSD 8 Release
since I got it 3 times already on a busy box that use heavily NFS (with lots
of files).
Unfortunately my processes are not compiled with debug symbols(so I cannot
get an backtrace), but I've got all the php-cgi stuck in the "rpccon" state
just like described here, I cannot kill them and I cannot cleanly reboot,
manual restart is required.

FreeBSD g4.torrentsmd.com 8.0-RELEASE FreeBSD 8.0-RELEASE #0: Sat Nov 21
15:02:08 UTC 2009 ***@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC
amd64

4063 www 1 52 0 82576K 26320K rpccon 3 1:40 0.00% php-cgi
4078 www 1 48 0 83600K 26768K rpccon 1 1:37 0.00% php-cgi
4129 www 1 52 0 83600K 26740K rpccon 1 1:31 0.00% php-cgi
4159 www 1 55 0 82832K 26216K rpccon 0 1:24 0.00% php-cgi
4184 www 1 54 0 90768K 34104K rpccon 0 1:16 0.00% php-cgi
4174 www 1 50 0 82832K 23396K rpccon 0 1:15 0.00% php-cgi
4258 www 1 55 0 82064K 24224K rpccon 1 1:06 0.00% php-cgi

I belive the error was triggered when
Mar 9 20:00:31 sv kernel: nfs server s:/path/pah/paf: not responding
Mar 9 20:00:36 sv last message repeated 23 times

My fstab look like this, I use the -b flag
sv:/path/pah/paf /path/fap/hap/afh nfs rw,-b 0 0

Since it's a remote box and I'm afraid to screw up the kernel recompilation
of the "Stable", and I'm not even sure it will help, do you have any
suggestions ? Thank you.

Nicu.

Post by Rick Macklem

Got another deadlock after upgrading. Again, busy NFS volume, and ^C'ing
db> bt 89596
Tracing pid 89596 tid 102493 td 0xffffff0089260000
sched_switch() at sched_switch+0x17c
mi_switch() at mi_switch+0x21d
sleepq_switch() at sleepq_switch+0x123
sleepq_timedwait() at sleepq_timedwait+0x4d
_sleep() at _sleep+0x301
clnt_reconnect_call() at clnt_reconnect_call+0x5d3
nfs_request() at nfs_request+0x225
nfs_statfs() at nfs_statfs+0x197
__vfs_statfs() at __vfs_statfs+0x28
kern_fstatfs() at kern_fstatfs+0x286
fstatfs() at fstatfs+0x34
syscall() at syscall+0x1af
Xfast_syscall() at Xfast_syscall+0xd0
--- syscall (397, FreeBSD ELF64, fstatfs), rip = 0x800726dcc, rsp =
0x7fffffffe1a8, rbp = 0x1000 ---
These are mounted with intr, I'll try disabling that next.
There are two sleeps in clnt_rc.c. One of them optionally does a PCATCH

and returns when interrupted via ^C, but the other one (which it is
sleeping on above), doesn't. I've emailed Kris a small patch that
changes that for him to test.
If anyone else wants to test the patch, just email me for a copy, rick
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-current

Rick Macklem

2010-03-10 02:44:14 UTC

Permalink

Post by Adrenalin
Hi, I would like to know if this bug has been fixed in the FreeBSD 8 Release
since I got it 3 times already on a busy box that use heavily NFS (with lots
of files).
Unfortunately my processes are not compiled with debug symbols(so I cannot
get an backtrace), but I've got all the php-cgi stuck in the "rpccon" state
just like described here, I cannot kill them and I cannot cleanly reboot,
manual restart is required.

I suspect that what you are seeing is fixed by:
http://people.freebsd.org/~rmacklem/patches/freebsd8-clntvc.patch

It is in head and stable/8, but did not make the FreeBSD8.0 release.

I believe that the patch is safe to add to a FreeBSD8.0 release kernel.

rick

Adrenalin

2010-03-10 11:38:18 UTC

Permalink

Rick thank you for the answer.

I'll "put my heart into the teeth" and will try to go for the stable.

This is kind of critical I guess, I think it should really go to the
release. It's the first time I got into a such a nasty bug with freebsd.

Post by Adrenalin
Hi, I would like to know if this bug has been fixed in the FreeBSD 8

Post by Adrenalin
Release
since I got it 3 times already on a busy box that use heavily NFS (with lots
of files).
Unfortunately my processes are not compiled with debug symbols(so I cannot
get an backtrace), but I've got all the php-cgi stuck in the "rpccon" state
just like described here, I cannot kill them and I cannot cleanly reboot,
manual restart is required.

http://people.freebsd.org/~rmacklem/patches/freebsd8-clntvc.patch<http://people.freebsd.org/%7Ermacklem/patches/freebsd8-clntvc.patch>
It is in head and stable/8, but did not make the FreeBSD8.0 release.
I believe that the patch is safe to add to a FreeBSD8.0 release kernel.
rick

Julian Elischer

2010-03-10 18:31:33 UTC

Permalink

Post by Adrenalin
Rick thank you for the answer.
I'll "put my heart into the teeth" and will try to go for the stable.
This is kind of critical I guess, I think it should really go to the
release. It's the first time I got into a such a nasty bug with freebsd.

Unfortunately the release has 'left the station', however I guess we
might be able to add a pointer in the erratum document.
I'm not sure who is in charge of that doc. (re?)

Post by Adrenalin

Post by Adrenalin
Hi, I would like to know if this bug has been fixed in the FreeBSD 8

_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-current

Rick Macklem

2010-03-11 21:01:10 UTC

Permalink

Post by Adrenalin
I'll "put my heart into the teeth" and will try to go for the stable.

You can just apply this one patch to your 8.0 kernel, there is no need
to go to stable/8 (at least for this fix).

Post by Adrenalin
This is kind of critical I guess, I think it should really go to the
release. It's the first time I got into a such a nasty bug with freebsd.

It wasn't identified until the release was headed out the door, so it
couldn't make it in, but will be in 8.1.

rick

Adrenalin

2010-03-22 18:26:47 UTC

Permalink

That's strange, after recompiling the lastest 8_0 that contain the patch (
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/rpc/clnt_vc.c.diff?r1=1.8.2.2.2.1;r2=1.8.2.2.2.2)
after 5 days it stuck again with same symptoms, I've also got some in the
nfs state:

FreeBSD .. 8.0-RELEASE-p2 FreeBSD 8.0-RELEASE-p2 #0: Tue Mar 16 22:56:51 EET
2010 ***@..:/usr/obj/usr/src/sys/MYGEN amd64

When attaching the debugger for an rpccon process, It stuck in here
#0 0x000000080124051c in stat () from /lib/libc.so.7

Loading Image...

Can I do the online debug of the kernel, or how can I can help you to solve
the problem ?

Rick Macklem

2010-03-25 14:16:53 UTC

Permalink

Post by Adrenalin
That's strange, after recompiling the lastest 8_0 that contain the patch (
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/rpc/clnt_vc.c.diff?r1=1.8.2.2.2.1;r2=1.8.2.2.2.2)
after 5 days it stuck again with same symptoms, I've also got some in the
FreeBSD .. 8.0-RELEASE-p2 FreeBSD 8.0-RELEASE-p2 #0: Tue Mar 16 22:56:51 EET
When attaching the debugger for an rpccon process, It stuck in here
#0 0x000000080124051c in stat () from /lib/libc.so.7
http://img705.imageshack.us/img705/741/10032219218.png
Can I do the online debug of the kernel, or how can I can help you to solve
the problem ?

Well, sleeping in "rpccon" means that the TCP connect has failed after a
soconnect() call. If you can get into a kernel debugger, there is a
global structure with more error information in it.
It is called: rpc_createerr
- and it has 2 enums, followed by an int. The first enum should be 12
(RPC_SYSTEMERR), which is what gets it to tsleep(.."rpccon"..), the
second enum doesn't apply to this case and the int after them should
be the errno of the soconnect() failure. (The way the code is currently
written, it could either be an error return from soconnect() or a value
set in so_error after soconnect() returns, while it is in the process
of connecting.

So, if you can get to that 3rd field, the value there might help tell
why the TCP connect is failing. Otherwise, all I can suggest is poking
around and trying to figure out why TCP connects are failing.
- wedged network interface
- routing problem
- network infrastructure problem
...
(Btw, I was driven a little batty at UofG because the campus network
switch I was on would decide to inject TCP RSTs into new connection
attempts for some reason. I finally was able to determine this by
looking at packet traces on both client and server and see the RSTs
coming out of the network on the client end, but never sent on the
server end. It was some Cisco related parameter/issue that was never
resolved.)

Hopefully others with more TCP expertise can make suggestions w.r.t.
why the TCP connects are failing?

Good luck with it, rick