[OpenAFS-Doc] Process signals

Discussion:

Jason Edgecombe

2007-08-25 14:27:35 UTC

Hi All,

Derrick gave a very useful tidbit on the -info list. You can use kill
-XCPU on the fileserver to print out a list of connected clients.

This isn't in the fileserver man page. Are there any objections to
putting it in there? What section should it go under, "description"
possibly?

What other useful signals are lurking out there and where can find out
what they are and what they do? I think these need to be documented.

Sincerely,
Jason

-------- Original Message --------
Subject: Re: [OpenAFS] IP-based ACLs failing
Date: Sat, 25 Aug 2007 01:19:55 -0400
From: Derrick Brashear <***@gmail.com>
To: Stephen Joyce <***@physics.unc.edu>
CC: openafs-***@openafs.org
References: <***@hellbender.physics.unc.edu>

kill -XCPU the fileserver, and look at the host list. I bet the IP
addresses you care about show "alternate" addresses (presumably illegit).

Jeffrey Altman

2007-08-25 15:17:48 UTC

Permalink

Post by Jason Edgecombe
Hi All,
Derrick gave a very useful tidbit on the -info list. You can use kill
-XCPU on the fileserver to print out a list of connected clients.
This isn't in the fileserver man page. Are there any objections to
putting it in there? What section should it go under, "description"
possibly?
What other useful signals are lurking out there and where can find out
what they are and what they do? I think these need to be documented.
Sincerely,
Jason

There should be a section on trouble shooting. I would put it there.

Russ Allbery

2007-08-25 16:07:02 UTC

Permalink

Post by Jason Edgecombe
Derrick gave a very useful tidbit on the -info list. You can use kill
-XCPU on the fileserver to print out a list of connected clients.
This isn't in the fileserver man page. Are there any objections to putting
it in there? What section should it go under, "description" possibly?

A separate troubleshooting section sounds right to me as well.

Post by Jason Edgecombe
What other useful signals are lurking out there and where can find out
what they are and what they do? I think these need to be documented.

Here's an internal Stanford document that includes a bunch of
Stanford-specific tricks but has some additional details like that worth
extracting.

Author: Russ Allbery <***@stanford.edu>
Subject: Debugging AFS file server load problems
Revision: $Id: debug-fileserver,v 1.6 2005/01/21 01:25:10 eagle Exp $

The basic metric of whether an AFS file server is doing well is its
blocked connection count. We regularly monitor this two ways, once via
Nagios to send pages and mail if it goes over a fairly low number, and
once for the statistics page:

<http://www.stanford.edu/services/afs/cellinfo/clients.html>

If the blocked connection count is ever above 0, the server is having
problems replying to clients in a timely fashion. If it gets above 10,
roughly, there will be user-noticable slowness. (The total number of
connections is a mostly irrelevant number that goes essentially
monotonically for as long as the server has been running and then goes
back down to zero when it's restarted.)

To determine the blocked connection count by hand, run:

/usr/afsws/etc/rxdebug <server> | grep waiting_for

Each line returned is a blocked connection.

The most common cause of blocked connections rising on a server is some
process somewhere performing an abnormal number of accesses to that server
and its volumes. If multiple servers have a blocked connection count, the
most likely explanation is that there is a volume replicated between those
servers that is absorbing an abnormally high access rate.

To get an access count on all the volumes on a server, run:

vos listvol <server> -long

and save the output in a file. The results will look like a bunch of vos
examine output for each volume on the server. Look for lines like:

40065 accesses in the past day (i.e., vnode references)

and look for volumes with an abnormally high number of accesses. Anything
over 10,000 is fairly high, but some of our core infrastructure volumes
like users.a, pubsw, systems, group.homepage, and the like will have that
many hits routinely. Anything over 100,000 is generally abnormally high.
The count resets about once a day.

Another approach that can be used to narrow the possibilities for a
replicated volume, when multiple servers are having trouble, is to find
all replicated volumes for that server. Run:

lvldbs <server>

where <server> is one of the servers having problems to refresh the VLDB
cache in /afs/ir/service/afs/data for that server, and then run:

shortvldb <server> <partition>

to get a list of all volumes on that server and partition, including every
other server that they're replicated to. So, for example, if volumes are
replicated on afssvr19 /vicepa, afssvr23, and afssvr22, a command like:

lvldbs afssvr19
shortvldb afssvr19 a | grep '22.' | grep '23.'

will show you all of the volumes replicated across those three servers.

Once the volume causing the problem has been identified, the best way to
deal with the problem is to move that volume to another server with a low
load. Often the volume will be enough information to tell what's going on
by scanning the cluster for scripts run by that user (if it's a user
volume) or using that program (if it's a pubsw volume).

If you still need additional information about who's hitting that server,
sometimes you can guess at that information from the failed callbacks in
the FileLog log in /var/log/afs on the server, or from the output of:

/usr/afsws/etc/rxdebug <server> -rxstats

but the best way is to turn on debugging output from the file server.
(Warning: This generates a *lot* of output into FileLog on the AFS
server.) To do this, log on to the AFS server, find the PID of the
fileserver process, and do:

kill -TSTP <pid>

This will raise the debugging level so that you'll start seeing what
people are actually doing on the server. You can do this up to three more
times to get even more output if needed. To reset the debugging level
back to normal, use:

kill -HUP <pid>

(No, this won't terminate the file server.) Be sure to reset debugging
back to normal when you're done, or the AFS server may well fill its disks
with debugging output.

The lines of the debugging output that I've found the most useful for
debugging load problems are:

SAFS_FetchStatus, Fid = 2003828163.77154.82248, Host 171.64.15.76
SRXAFS_FetchData, Fid = 2003828163.77154.82248

(partly truncated to highlight the interesting information). The Fid
identifies the volume and inode within the volume; the volume is the first
long number. So, for example, this was:

afssvr5:~> vos examine 2003828163
pubsw.matlab61 2003828163 RW 1040060 K On-line
afssvr5.Stanford.EDU /vicepa
RWrite 2003828163 ROnly 2003828164 Backup 2003828165
MaxQuota 3000000 K
Creation Mon Aug 6 16:40:55 2001
Last Update Tue Jul 30 19:00:25 2002
86181 accesses in the past day (i.e., vnode references)

RWrite: 2003828163 ROnly: 2003828164 Backup: 2003828165
number of sites -> 3
server afssvr5.Stanford.EDU partition /vicepa RW Site
server afssvr11.Stanford.EDU partition /vicepd RO Site
server afssvr5.Stanford.EDU partition /vicepa RO Site

and from the Host information one can tell what system is accessing that
volume.

Note that the output of vos examine also includes the access count, so
once the problem has been identified, vos examine can be used to see if
the access count is still increasing. Also remember that you can run vos
examine on, e.g., pubsw.matlab61.readonly to see the access counts on the
read-only replica on all of the servers that it's located on.

--
Russ Allbery (***@stanford.edu) <http://www.eyrie.org/~eagle/>

Jason Edgecombe

2007-08-25 17:32:22 UTC

Permalink

Post by Russ Allbery
A separate troubleshooting section sounds right to me as well.

"perldoc pod2man" lists a "diagnostic" section, but not a
troubleshooting section. The diagnostic section lists all of the
progam's messages and what they mean.

Which is more appropriate and descriptive? Troubleshooting or
Diagnostic? I'm leaning towards diagnostic.

Jason

Russ Allbery

2007-08-25 17:53:33 UTC

Permalink

Post by Jason Edgecombe

Post by Russ Allbery
A separate troubleshooting section sounds right to me as well.

"perldoc pod2man" lists a "diagnostic" section, but not a
troubleshooting section. The diagnostic section lists all of the
progam's messages and what they mean.
Which is more appropriate and descriptive? Troubleshooting or
Diagnostic? I'm leaning towards diagnostic.

I think Troubleshooting is slightly clearer, but I don't have a strong
preference.

--
Russ Allbery (***@stanford.edu) <http://www.eyrie.org/~eagle/>

Christopher D. Clausen

2007-08-25 18:00:06 UTC

Permalink

Post by Russ Allbery

Post by Jason Edgecombe

Post by Russ Allbery
A separate troubleshooting section sounds right to me as well.

"perldoc pod2man" lists a "diagnostic" section, but not a
troubleshooting section. The diagnostic section lists all of the
progam's messages and what they mean.
Which is more appropriate and descriptive? Troubleshooting or
Diagnostic? I'm leaning towards diagnostic.

I think Troubleshooting is slightly clearer, but I don't have a strong
preference.

Isn't kill -TSTP the same as specifying a -debug XXX number as a command
line parameter to the various server binaries? Can't you just include
this information under the -debug option description?

<<CDC

Russ Allbery

2007-08-25 18:24:04 UTC

Permalink

Post by Christopher D. Clausen
Isn't kill -TSTP the same as specifying a -debug XXX number as a command
line parameter to the various server binaries?

Yes.

Post by Christopher D. Clausen
Can't you just include this information under the -debug option
description?

I don't think anyone would ever find it there.

The other option is to have a section explicitly titled SIGNALS and put
this stuff there, which isn't a bad idea. I think we could still use a
TROUBLESHOOTING section, though, for various other things.

--
Russ Allbery (***@stanford.edu) <http://www.eyrie.org/~eagle/>

Jason Edgecombe

2007-08-25 20:52:48 UTC

Permalink

Post by Russ Allbery

A separate troubleshooting section sounds right to me as well.

Post by Jason Edgecombe
What other useful signals are lurking out there and where can find out
what they are and what they do? I think these need to be documented.

Here's an internal Stanford document that includes a bunch of
Stanford-specific tricks but has some additional details like that worth
extracting.

I have attached the diff for the updated fileserver.pod man page which
includes the signals and your document.

Sincerely,
Jason

Jeffrey Altman

2007-08-25 21:04:18 UTC

Permalink

My guess is that 'lvldbs' and 'shortvldb' are Stanford specific scripts.
As such they shouldn't be referenced in a man page.

Jason Edgecombe

2007-08-25 21:32:34 UTC

Permalink

Post by Jeffrey Altman
My guess is that 'lvldbs' and 'shortvldb' are Stanford specific scripts.
As such they shouldn't be referenced in a man page.

You're right. Here is the second draft with that paragraph removed and
including Andrew Deason's correction.

Thanks,
Jason