Nagios NRPE to Monitor Remote Linux Server

NRPE Remote Server Installation and Setup
Create Nagios user account on remote server to be monitored:

# useradd nagios
# passwd nagios
Download and Install Nagios Plugins:

# mkdir -p /opt/Nagios/Nagios_Plugins
# cd /opt/Nagios/Nagios_Plugins
Save file to directory /opt/Nagios

http://www.nagios.org/download/download.php

As of this writing Nagios 3.0.6 (Stable) and Nagios Plugins 1.4.13 (Stable)

Extract Files:

# tar xzf nagios-plugins-1.4.13.tar.gz

# cd nagios-plugins-1.4.13
Compile and Configure Nagios Plugins

** You need the openssl-devel package installed to compile plugins with ssl support. **

# yum -y install openssl-devel
Instal Plugins:

# ./configure –with-nagios-user=nagios –with-nagios-group=nagios
# make
# make install
The permissions on the plugin directory and the plugins will need to be changed to nagios user

# chown nagios.nagios /usr/local/nagios
# chown -R nagios.nagios /usr/local/nagios/libexec
Package xinted is needed

# yum install xinetd
Downlad and Install NRPE Daemon

# mkdir -p /opt/Nagios/Nagios_NRPE
# cd /opt/Nagios/Nagios_NRPE
Save file to directory /opt/Nagios

http://www.nagios.org/download/download.php

As of this writing NRPE 2.12 (Stable)

Extract the Files:

# tar -xzf nrpe-2.12.tar.gz
# cd nrpe-2.12
Compile and Configure NRPE

** You need the openssl-devel package installed to compile NRPE with ssl support. **

# yum -y install openssl-devel
Install NRPE:

# ./configure

General Options:
————————-
NRPE port: 5666
NRPE user: nagios
NRPE group: nagios
Nagios user: nagios
Nagios group: nagios

# make all

# make install-plugin

# make install-daemon

# make install-daemon-config

# make install-xinetd
Post NRPE Configuration

Edit Xinetd NRPE entry:

Add Nagios Monitoring server to the “only_from” directive

# vi /etc/xinetd.d/nrpe

only_from = 127.0.0.1
Edit services file entry:

Add entry for nrpe daemon

# vi /etc/services

nrpe 5666/tcp # NRPE
Restart Xinetd and Set to start at boot:

# chkconfig xinetd on

# service xinetd restart
Test NRPE Daemon Install

Check NRPE daemon is running and listening on port 5666:

# netstat -at |grep nrpe
Output should be:

tcp 0 0 *:nrpe *.* LISTEN
Check NRPE daemon is functioning:

# /usr/local/nagios/libexec/check_nrpe -H localhost
Output should be NRPE version:

NRPE v2.12
Open Port 5666 on Firewall

Make sure to open port 5666 on the firewall of the remote server so that the Nagios monitoring server can access the NRPE daemon.

Nagios Monitoring Host Server Setup
Downlad and Install NRPE Plugin

# mkdir -p /opt/Nagios/Nagios_NRPE
# cd /opt/Nagios/Nagios_NRPE
Save file to directory /opt/Nagios

http://www.nagios.org/download/download.php

As of this writing NRPE 2.12 (Stable)

Extract the Files:

# tar -xzf nrpe-2.12.tar.gz
# cd nrpe-2.12
Compile and Configure NRPE

# ./configure

# make all

# make install-plugin
Test Connection to NRPE daemon on Remote Server

Lets now make sure that the NRPE on our Nagios server can talk to the NRPE daemon on the remote server we want to monitor. Replace “” with the remote servers IP address.

# /user/local/nagios/libexec/check_nrpe -H
NRPE v2.12
Create NRPE Command Definition

A command definition needs to be created in order for the check_nrpe plugin to be used by nagios.

# vi /usr/local/nagios/etc/objects/commands.cfg
Add the following:

###############################################################################
# NRPE CHECK COMMAND
#
# Command to use NRPE to check remote host systems
###############################################################################

define command{
command_name check_nrpe
command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$
}
Create Linux Object template

In order to be able to add the remote linux machine to Nagios we need to create an object template file adn add some object definitions.

Create new linux-box-remote object template file:

# vi /usr/local/nagios/etc/objects/linux-box-remote.cfg
Add the following and replace the values “host_name” “alias” “address” with the values that match your setup:

** The “host_name” you set for the “define_host” section must match the “host_name” in the “define_service” section **

define host{
name linux-box-remote ; Name of this template
use generic-host ; Inherit default values
check_period 24×7
check_interval 5
retry_interval 1
max_check_attempts 10
check_command check-host-alive
notification_period 24×7
notification_interval 30
notification_options d,r
contact_groups admins
register 0 ; DONT REGISTER THIS – ITS A TEMPLATE
}

define host{
use linux-box-remote ; Inherit default values from a template
host_name Centos5 ; The name we’re giving to this server
alias Centos5 ; A longer name for the server
address 192.168.0.5 ; IP address of the server
}

define service{
use generic-service
host_name Centos5
service_description CPU Load
check_command check_nrpe!check_load
}
define service{
use generic-service
host_name Centos5
service_description Current Users
check_command check_nrpe!check_users
}
define service{
use generic-service
host_name Centos5
service_description /dev/hda1 Free Space
check_command check_nrpe!check_hda1
}
define service{
use generic-service
host_name Centos5
service_description Total Processes
check_command check_nrpe!check_total_procs
}
define service{
use generic-service
host_name Centos5
service_description Zombie Processes
check_command check_nrpe!check_zombie_procs
}
Activate the linux-box-remote.cfg template:

# vi /usr/local/nagios/etc/nagios.cfg
And add:

# Definitions for monitoring remote Linux machine
cfg_file=/usr/local/nagios/etc/objects/linux-box-remote.cfg
Verify Nagios Configuration Files:

# /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
Total Warnings: 0
Total Errors: 0
Restart Nagios:

# service nagios restart
Check Nagios Monitoring server that the remote linux box was added and is being monitored !

TroubleShooting
NRPE ./configure error:

checking for SSL headers… configure: error: Cannot find ssl headers
Solution:

You need to install the openssl-devel package

# yum -y install openssl-devel
CHECK_NRPE: Error – Could not complete SSL handshake
Solution:

This is most likely not a probem with SSL but rather with Xinetd access restrictions.

Check the following files:

/etc/xinetd.d/nrpe

/etc/hosts.allow

/etc/hosts.deny

Howto add an disk to Solaris 10

Howto add an disk to Solaris 10 6/06
Written by Administrator
Friday, 14 July 2006
Howto add an SATA disk to Solaris 10 6/06

An common task is to connect another SATA disk to the system. When you’re used on running Linux, you will find it logic that Linux just sees the disk, and you can fdisk it right away. Solaris is just another story.

To let solaris know a new disk is added I run the cmd :

devfsadm -vC

It scans the system and will add and remove new and old devices. When the cmd completes, time for another cmd :

format

This cmd let you choose a disk to format, and create an Solaris label on the disk. After creating the label, have fun by slicing the disk. (It’s a killer🙂 Preparing an disk is done like :

# format

Searching for disks…done

AVAILABLE DISK SELECTIONS:

0. c2d0

/pci@0,0/pci-ide@8/ide@1/cmdk@0,0

1. c3d0

/pci@0,0/pci-ide@7/ide@0/cmdk@0,0

Specify disk (enter its number): 1

selecting c3d0

Controller working list found

[disk formatted, defect list found]

FORMAT MENU:

disk – select a disk

type – select (define) a disk type

partition – select (define) a partition table

current – describe the current disk

format – format and analyze the disk

fdisk – run the fdisk program

repair – repair a defective sector

show – translate a disk address

label – write label to the disk

analyze – surface analysis

defect – defect list management

backup – search for backup labels

verify – read and display labels

save – save new disk/partition definitions

volname – set 8-character volume name

! – execute , then return

quit

format> fdisk

No fdisk table exists. The default partition for the disk is:

a 100% “SOLARIS System” partition

Type “y” to accept the default partition, otherwise type “n” to edit the

partition table.

y

format> lABEL

`lABEL’ is not expected.

format> label

Ready to label disk, continue? y

format> verify

Warning: Primary label on disk appears to be different from

current label.

Warning: Check the current partitioning and ‘label’ the disk or use the

‘backup’ command.

Primary label contents:

Volume name =

ascii name =

pcyl = 30400

ncyl = 30398

acyl = 2

bcyl = 0

nhead = 255

nsect = 63

Part Tag Flag Cylinders Size Blocks

0 unassigned wm 0 0 (0/0/0) 0

1 unassigned wm 0 0 (0/0/0) 0

2 backup wu 0 – 30397 232.86GB (30398/0/0) 488343870

3 unassigned wm 0 0 (0/0/0) 0

4 unassigned wm 0 0 (0/0/0) 0

5 unassigned wm 0 0 (0/0/0) 0

6 unassigned wm 0 0 (0/0/0) 0

7 unassigned wm 0 0 (0/0/0) 0

8 boot wu 0 – 0 7.84MB (1/0/0) 16065

9 alternates wm 1 – 2 15.69MB (2/0/0) 32130

format> par

PARTITION MENU:

0 – change `0′ partition

1 – change `1′ partition

2 – change `2′ partition

3 – change `3′ partition

4 – change `4′ partition

5 – change `5′ partition

6 – change `6′ partition

7 – change `7′ partition

select – select a predefined table

modify – modify a predefined partition table

name – name the current table

print – display the current table

label – write partition map and label to the disk

! – execute , then return

quit

partition> mod

Select partitioning base:

0. Current partition table (original)

1. All Free Hog

Choose base (enter number) [0]? 1

Part Tag Flag Cylinders Size Blocks

0 root wm 0 0 (0/0/0) 0

1 swap wu 0 0 (0/0/0) 0

2 backup wu 0 – 30397 232.86GB (30398/0/0) 488343870

3 unassigned wm 0 0 (0/0/0) 0

4 unassigned wm 0 0 (0/0/0) 0

5 unassigned wm 0 0 (0/0/0) 0

6 usr wm 0 0 (0/0/0) 0

7 unassigned wm 0 0 (0/0/0) 0

8 boot wu 0 – 0 7.84MB (1/0/0) 16065

9 alternates wm 1 – 2 15.69MB (2/0/0) 32130

Do you wish to continue creating a new partition

table based on above table[yes]?

Free Hog partition[6]?

Enter size of partition ‘0’ [0b, 0c, 0.00mb, 0.00gb]:

Enter size of partition ‘1’ [0b, 0c, 0.00mb, 0.00gb]:

Enter size of partition ‘3’ [0b, 0c, 0.00mb, 0.00gb]:

Enter size of partition ‘4’ [0b, 0c, 0.00mb, 0.00gb]:

Enter size of partition ‘5’ [0b, 0c, 0.00mb, 0.00gb]:

Enter size of partition ‘7’ [0b, 0c, 0.00mb, 0.00gb]:

Part Tag Flag Cylinders Size Blocks

0 root wm 0 0 (0/0/0) 0

1 swap wu 0 0 (0/0/0) 0

2 backup wu 0 – 30397 232.86GB (30398/0/0) 488343870

3 unassigned wm 0 0 (0/0/0) 0

4 unassigned wm 0 0 (0/0/0) 0

5 unassigned wm 0 0 (0/0/0) 0

6 usr wm 3 – 30397 232.84GB (30395/0/0) 488295675

7 unassigned wm 0 0 (0/0/0) 0

8 boot wu 0 – 0 7.84MB (1/0/0) 16065

9 alternates wm 1 – 2 15.69MB (2/0/0) 32130

Okay to make this the current partition table[yes]?

Enter table name (remember quotes): “datadsk”

Ready to label disk, continue? y

partition> q

FORMAT MENU:

disk – select a disk

type – select (define) a disk type

partition – select (define) a partition table

current – describe the current disk

format – format and analyze the disk

fdisk – run the fdisk program

repair – repair a defective sector

show – translate a disk address

label – write label to the disk

analyze – surface analysis

defect – defect list management

backup – search for backup labels

verify – read and display labels

save – save new disk/partition definitions

volname – set 8-character volume name

! – execute , then return

quit

format> save

Saving new disk and partition definitions

Enter file name[“./format.dat”]:

format> verify

Warning: Primary label on disk appears to be different from

current label.

Warning: Check the current partitioning and ‘label’ the disk or use the

‘backup’ command.

Primary label contents:

Volume name =

ascii name =

pcyl = 30400

ncyl = 30398

acyl = 2

bcyl = 0

nhead = 255

nsect = 63

Part Tag Flag Cylinders Size Blocks

0 unassigned wm 0 0 (0/0/0) 0

1 unassigned wm 0 0 (0/0/0) 0

2 backup wu 0 – 30397 232.86GB (30398/0/0) 488343870

3 unassigned wm 0 0 (0/0/0) 0

4 unassigned wm 0 0 (0/0/0) 0

5 unassigned wm 0 0 (0/0/0) 0

6 unassigned wm 3 – 30397 232.84GB (30395/0/0) 488295675

7 unassigned wm 0 0 (0/0/0) 0

8 boot wu 0 – 0 7.84MB (1/0/0) 16065

9 alternates wm 1 – 2 15.69MB (2/0/0) 32130

format> q

Two more steps to go, first a filesystem is needed, and after that the disk must be mounted. Creating an filesystem is easy :

# newfs /dev/dsk/c3d0s6
newfs: construct a new file system /dev/rdsk/c3d0s6: (y/n)? y
Warning: 4870 sector(s) in last cylinder unallocated
/dev/rdsk/c3d0s6: 488295674 sectors in 79476 cylinders of 48 tracks, 128 sectors
238425.6MB in 4968 cyl groups (16 c/g, 48.00MB/g, 5824 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
32, 98464, 196896, 295328, 393760, 492192, 590624, 689056, 787488, 885920,
Initializing cylinder groups:
…………………………………………………………………….
………………..
super-block backups for last 10 cylinder groups at:
487395104, 487493536, 487591968, 487690400, 487788832, 487887264, 487985696,
488084128, 488182560, 488280992

And mounting the disk :

# mount /dev/dsk/c3d0s6 /mnt

# df -h /mnt/
Filesystem size used avail capacity Mounted on
/dev/dsk/c3d0s6 229G 64M 227G 1% /mnt

Well that whas easy, wasn’t it ?

prtvtoc /dev/dsk/c1t1d0s2
To checked partition information

Add new partition details on /etc/fstab to work after reboot

/dev/dsk/c1t1d0s6 /dev/rdsk/c1t1d0s6 /oradata ufs 2 yes logging
/dev/dsk/c1t2d0s6 /dev/rdsk/c1t2d0s6 /orabackup ufs 2 yes logging

How to create a self-signed SSL Certificate

Overview

The following is an extremely simplified view of how SSL is implemented and what part the certificate plays in the entire process.

Normal web traffic is sent unencrypted over the Internet. That is, anyone with access to the right tools can snoop all of that traffic. Obviously, this can lead to problems, especially where security and privacy is necessary, such as in credit card data and bank transactions. The Secure Socket Layer is used to encrypt the data stream between the web server and the web client (the browser).

SSL makes use of what is known as asymmetric cryptography, commonly referred to as public key cryptography (PKI). With public key cryptography, two keys are created, one public, one private. Anything encrypted with either key can only be decrypted with its corresponding key. Thus if a message or data stream were encrypted with the server’s private key, it can be decrypted only using its corresponding public key, ensuring that the data only could have come from the server.

If SSL utilizes public key cryptography to encrypt the data stream traveling over the Internet, why is a certificate necessary? The technical answer to that question is that a certificate is not really necessary – the data is secure and cannot easily be decrypted by a third party. However, certificates do serve a crucial role in the communication process. The certificate, signed by a trusted Certificate Authority (CA), ensures that the certificate holder is really who he claims to be. Without a trusted signed certificate, your data may be encrypted, however, the party you are communicating with may not be whom you think. Without certificates, impersonation attacks would be much more common.

Step 1: Generate a Private Key

The openssl toolkit is used to generate an RSA Private Key and CSR (Certificate Signing Request). It can also be used to generate self-signed certificates which can be used for testing purposes or internal usage.

The first step is to create your RSA Private Key. This key is a 1024 bit RSA key which is encrypted using Triple-DES and stored in a PEM format so that it is readable as ASCII text.

openssl genrsa -des3 -out server.key 1024

Generating RSA private key, 1024 bit long modulus
…………………………………………………++++++
……..++++++
e is 65537 (0x10001)
Enter PEM pass phrase:
Verifying password – Enter PEM pass phrase:

Step 2: Generate a CSR (Certificate Signing Request)

Once the private key is generated a Certificate Signing Request can be generated. The CSR is then used in one of two ways. Ideally, the CSR will be sent to a Certificate Authority, such as Thawte or Verisign who will verify the identity of the requestor and issue a signed certificate. The second option is to self-sign the CSR, which will be demonstrated in the next section.

During the generation of the CSR, you will be prompted for several pieces of information. These are the X.509 attributes of the certificate. One of the prompts will be for “Common Name (e.g., YOUR name)”. It is important that this field be filled in with the fully qualified domain name of the server to be protected by SSL. If the website to be protected will be https://public.akadia.com, then enter public.akadia.com at this prompt. The command to generate the CSR is as follows:

openssl req -new -key server.key -out server.csr

Country Name (2 letter code) [GB]:CH
State or Province Name (full name) [Berkshire]:Bern
Locality Name (eg, city) [Newbury]:Oberdiessbach
Organization Name (eg, company) [My Company Ltd]:Akadia AG
Organizational Unit Name (eg, section) []:Information Technology
Common Name (eg, your name or your server’s hostname) []:public.akadia.com
Email Address []:martin dot zahn at akadia dot ch
Please enter the following ‘extra’ attributes
to be sent with your certificate request
A challenge password []:
An optional company name []:

Step 3: Remove Passphrase from Key

One unfortunate side-effect of the pass-phrased private key is that Apache will ask for the pass-phrase each time the web server is started. Obviously this is not necessarily convenient as someone will not always be around to type in the pass-phrase, such as after a reboot or crash. mod_ssl includes the ability to use an external program in place of the built-in pass-phrase dialog, however, this is not necessarily the most secure option either. It is possible to remove the Triple-DES encryption from the key, thereby no longer needing to type in a pass-phrase. If the private key is no longer encrypted, it is critical that this file only be readable by the root user! If your system is ever compromised and a third party obtains your unencrypted private key, the corresponding certificate will need to be revoked. With that being said, use the following command to remove the pass-phrase from the key:

cp server.key server.key.org
openssl rsa -in server.key.org -out server.key

The newly created server.key file has no more passphrase in it.

-rw-r–r– 1 root root 745 Jun 29 12:19 server.csr
-rw-r–r– 1 root root 891 Jun 29 13:22 server.key
-rw-r–r– 1 root root 963 Jun 29 13:22 server.key.org

Step 4: Generating a Self-Signed Certificate

At this point you will need to generate a self-signed certificate because you either don’t plan on having your certificate signed by a CA, or you wish to test your new SSL implementation while the CA is signing your certificate. This temporary certificate will generate an error in the client browser to the effect that the signing certificate authority is unknown and not trusted.

To generate a temporary certificate which is good for 365 days, issue the following command:

openssl x509 -req -days 365 -in server.csr -signkey server.key -out server.crt
Signature ok
subject=/C=CH/ST=Bern/L=Oberdiessbach/O=Akadia AG/OU=Information
Technology/CN=public.akadia.com/Email=martin dot zahn at akadia dot ch
Getting Private key

Step 5: Installing the Private Key and Certificate

When Apache with mod_ssl is installed, it creates several directories in the Apache config directory. The location of this directory will differ depending on how Apache was compiled.

cp server.crt /usr/local/apache/conf/ssl.crt
cp server.key /usr/local/apache/conf/ssl.key

Step 6: Configuring SSL Enabled Virtual Hosts

SSLEngine on
SSLCertificateFile /usr/local/apache/conf/ssl.crt/server.crt
SSLCertificateKeyFile /usr/local/apache/conf/ssl.key/server.key
SetEnvIf User-Agent “.*MSIE.*” nokeepalive ssl-unclean-shutdown
CustomLog logs/ssl_request_log \
“%t %h %{SSL_PROTOCOL}x %{SSL_CIPHER}x \”%r\” %b”

Step 7: Restart Apache and Test

/etc/init.d/httpd stop
/etc/init.d/httpd stop

https://public.akadia.com

How to find the WWN (World Wide Name) in Sun Solaris

World Wide Name (WWN) are unique 8 byte (64-bit) identifiers in SCSI or fibre channel similar to that of MAC Addresses on a Network Interface Card (NIC).
Talking about the WWN names, there are also
World Wide port Name (WWpN), a WWN assigned to a port on a Fabric which is what you would be looking for most of the time.
World Wide node Name (WWnN), a WWN assigned to a node/device on a Fibre Channel fabric.
To find the WWN numbers of your HBA card in Sun Solaris, you can use one the following procedures
Using fcinfo (Solaris 10 only)
This is probably the easiest way to find the WWN numbers on your HBA card. Here you can see the HBA Port WWN (WWpN) and the Node WWN (WWnN) of the two ports on the installed Qlogic HAB card.
This is also useful in finding the Model number, Firmwar version FCode, supported and current speeds and the port status of the HBA card/port.

root@ sunserver:/root# fcinfo hba-port | grep WWN
HBA Port WWN: 2100001b32xxxxxx
Node WWN: 2000001b32xxxxxx
HBA Port WWN: 2101001b32yyyyyy
Node WWN: 2001001b32yyyyyy
For detailed info including Make & model number, Firmware, Fcode and current status and supported/current speeds then
root@ sunserver:/root# fcinfo hba-port
HBA Port WWN: 2100001b32xxxxxx
OS Device Name: /dev/cfg/c2
Manufacturer: QLogic Corp.
Model: 375-3356-02
Firmware Version: 4.04.01
FCode/BIOS Version: BIOS: 1.24; fcode: 1.24; EFI: 1.8;
Type: N-port
State: online
Supported Speeds: 1Gb 2Gb 4Gb
Current Speed: 4Gb
Node WWN: 2000001b32xxxxxx
HBA Port WWN: 2101001b32yyyyyy
OS Device Name: /dev/cfg/c3
Manufacturer: QLogic Corp.
Model: 375-3356-02
Firmware Version: 4.04.01
FCode/BIOS Version: BIOS: 1.24; fcode: 1.24; EFI: 1.8;
Type: unknown
State: offline
Supported Speeds: 1Gb 2Gb 4Gb
Current Speed: not established
Node WWN: 2001001b32yyyyyy

Using scli

root@ sunserver:/root# scli -i | egrep “Node Name|Port Name”
Node Name : 20-00-00-1B-32-XX-XX-XX
Port Name : 21-00-00-1B-32-XX-XX-XX
Node Name : 20-01-00-1B-32-YY-YY-YY
Port Name : 21-01-00-1B-32-YY-YY-YY

For more detailed info on the HBA Cards run as follows: Similar to fcinfo but also provides Model Name and serial number.

root@ sunserver:/root# scli -i
——————————————————————————
Host Name : sunserver
HBA Model : QLE2462
HBA Alias :
Port : 1
Port Alias :
Node Name : 20-00-00-1B-32-XX-XX-XX
Port Name : 21-00-00-1B-32-XX-XX-XX
Port ID : 11-22-33
Serial Number : AAAAAAA-bbbbbbbbbb
Driver Version : qlc-20080514-2.28
FCode Version : 1.24
Firmware Version : 4.04.01
HBA Instance : 2
OS Instance : 2
HBA ID : 2-QLE2462
OptionROM BIOS Version : 1.24
OptionROM FCode Version : 1.24
OptionROM EFI Version : 1.08
OptionROM Firmware Version : 4.00.26
Actual Connection Mode : Point to Point
Actual Data Rate : 2 Gbps
PortType (Topology) : NPort
Total Number of Devices : 2
HBA Status : Online
——————————————————————————
Host Name : sunserver
HBA Model : QLE2462
HBA Alias :
Port : 2
Port Alias :
Node Name : 20-01-00-1B-32-YY-YY-YY
Port Name : 21-01-00-1B-32-YY-YY-YY
Port ID : 00-00-00
Serial Number : AAAAAAA-bbbbbbbbbb
Driver Version : qlc-20080514-2.28
FCode Version : 1.24
Firmware Version : 4.04.01
HBA Instance : 3
OS Instance : 3
HBA ID : 3-QLE2462
OptionROM BIOS Version : 1.24
OptionROM FCode Version : 1.24
OptionROM EFI Version : 1.08
OptionROM Firmware Version : 4.00.26
Actual Connection Mode : Unknown
Actual Data Rate : Unknown
PortType (Topology) : Unidentified
Total Number of Devices : 0
HBA Status : Loop down

Using prtconf
root@ sunserver:/root# prtconf -vp | grep -i wwn
port-wwn: 2100001b.32xxxxxx
node-wwn: 2000001b.32xxxxxx
port-wwn: 2101001b.32yyyyyy
node-wwn: 2001001b.32yyyyyy
Using prtpicl
root@ sunserver:/root# prtpicl -v | grep wwn
:node-wwn 20 00 00 1b 32 xx xx xx
:port-wwn 21 00 00 1b 32 xx xx xx
:node-wwn 20 01 00 1b 32 yy yy yy
:port-wwn 21 01 00 1b 32 yy yy yy

Using luxadm
Run the following command to obtain the physical path to the HBA Ports
root@ sunserver:/root$ luxadm -e port
/devices/pci@400/pci@0/pci@9/SUNW,qlc@0/fp@0,0:devctl CONNECTED
/devices/pci@400/pci@0/pci@9/SUNW,qlc@0,1/fp@0,0:devctl NOT CONNECTED

With the physical path obtained from the above command, we can trace the WWN numbers as follows. here I use the physical path to the one that is connected:
root@ sunserver:/root$ luxadm -e dump_map /devices/pci@400/pci@0/pci@9/SUNW,qlc@0/fp@0,0:devctl
Pos Port_ID Hard_Addr Port WWN Node WWN Type
0 123456 0 1111111111111111 2222222222222222 0×0 (Disk device)
1 789123 0 1111111111111111 2222222222222222 0×0 (Disk device)
2 453789 0 2100001b32xxxxxx 2000001b32xxxxxx 0×1f (Unknown Type,Host Bus Adapter)

Hope this helps. If you know of any more way then please feel free to post it to the comments and I shall amend it to the article.

Solaris 10 memory usage analysis

Solaris 10 memory usage analysis

It so happens that I need to get a bit more insight into what’s eating all the RAM on one of my solaris boxes. Whenever this happens I can never remember all the various incantations, so I’m putting them all here for future reference.
Most of these need to run as root.

prstat -a -s rss
– Quick overview of top processes ordered by physical memory consumption, plus memory consumption per user. Note that if you have lots of processes all sharing (say) a 1GB bit of shard memory, each process will show up as using 1GB (very noticeable with oracle, where there can be 100 processes each with a hook into the multi-GB SGA)

ls -l /proc/{pid}/as
nice easy way to see the address space (total memory usage) for a single process. Good for when you want to see the memory usage of a set of processes which is too large to fit into prstat e.g. _

# is apache leaking?
for pid in `pgrep httpd`
do
ls -l /proc/$pid/as
done
vmstat -S 3
Am I swapping? watch the swap in/out columns; if they’re not 0, you need more RAM

vmstat 3
Am I thinking about swapping? The sr (Scan Rate) column tells you when you’re starting to run low on memory, and the kernel is scanning physical memory to find blocks that can be swapped out. c.f. Myth: Using swap is bad for performance

echo “::memstat” | mdb -k
How much memory is being used by the kernel and/or the (UFS) file system caches? (n.b. the kernel memory usage includes the ZFS ARC cache – see below) Warning this can take several minutes to run, and sucks up a lot of CPU time.

kstat -m zfs
How much memory is the ZFS ARC cache using? (n.b. if you have lots of ZFS data, this can be a very big number; the ARC cache will use up to (system RAM -1GB), but it should release RAM as soon as other apps need it.

chroot login HOWTO

chroot login HOWTO
Introduction
Testing new cross-toolchains properly requires overriding your target system’s core shared libraries with the newly created, and probably buggy, ones. Doing this without nuking your target system requires creating a sandbox of some sort. This document describes one way to set up a chroot jail on your target system suitable for running gcc and glibc remote regression tests (or many other purposes).
1. Set Up and Test a Jail
Two shell scripts are provided as an example of how to set up a jail on linux. Please read and understand them. They are provided as executable shell scripts, but they should not be run blindly. You may need to edit the two scripts to reflect your target system’s peculiarities and the needs of the programs you intend to run in the jail. The scripts as furnished have been tested both on Debian x86 and on an embedded PPC Linux system based on busybox, and contain everything needed to run the gcc and glibc regression tests (which isn’t much). They assume that either busybox or ash is available, so if you’re running this on a workstation, you’ll need install ash before running them. (ash is preferred over bash simply because it has fewer dependencies; you can use bash, but you’ll need to modify the scripts to add the additional libraries used by bash, e.g. ncurses.)
The first shell script, mkjail.sh, makes a tarball containing core shared libraries and the jail’s etc/passwd file. The second shell script, initjail.sh, unpacks that tarball into the jail, and adds crucial /dev entries, a /proc filesystem, /etc files, core programs like sh, and non-toolchain shared libraries, and appends a given file to the jail’s /etc/passwd file.

The two-step process is exactly what you need when testing cross-toolchains; you run the first script on the development system, and the second script on the target system. However, you should run them both on the target system initially and verify that the jail works before testing them with your newly compiled and probably buggy toolchain’s shared libraries.

1.1. Setting up a jail
For example, here’s how to use the scripts to set up a minimal jail containing a copy of the system libraries and system /etc/passwd file:
$ sh mkjail.sh / /etc/passwd
$ su
# zcat jail.tar.gz | sh initjail.sh myjail

1.2. Testing the Jail
Once you have the jail set up, test it by hand using the /usr/sbin/chroot program to run a shell inside the jail, e.g.
$ su
# /usr/sbin/chroot `pwd`/myjail /bin/sh

(Note: the argument to chroot must be an absolute path, else exec fails. This appears to be a bug in Linux.) If you can’t run a shell inside the jail, try running something easier, e.g. /bin/true (the simplest program there is):
$ su
# /usr/sbin/chroot `pwd`/myjail /bin/true

Once you get that working, go back to trying the shell. The most common cause of programs not running in the jail is missing shared libraries; to fix that, just copy the missing libraries from the system into the corresponding place in the jail. (Note that shared libraries consist of one or two files, and zero or more symbolic links; take care to not follow symbolic links when copying. In the provided scripts, I use the -d option to /bin/cp to avoid dereferencing links.)
2. Configure Jail Login Scheme
Specific users can be configured such that the moment they log in, a wrapper program (see chrootshell.c) jails the user in his home directory using the chroot system call, looks up his record in the jail’s private /etc/passwd file, and uses it to set his current directory and transfer control to his preferred shell. Every program and shared library the user executes then comes from the jail, not from the surrounding system.
Only users who have the root password can set up jails, partly because setting up a jail requires mounting /proc inside the jail. Users with the root password can set up jails on behalf of lesser users (though you might want to take away their ability to run setuid root programs if you do that, else they might be able to get out of the jail…)

2.1. Install chrootshell
chrootshell will be installed setuid root, so first audit the source (chrootshell.c) for security problems, then compile and install it with the commands
$ cc chrootshell.c -o chrootshell
$ su
# install -o root -m 4755 chrootshell /sbin/chrootshell

You probably need to add the line
/sbin/chrootshell

to /etc/shells, else login may refuse to run it.
2.2. Create outer /etc/passwd entry
For each user you want to give access to a jail, create a new /etc/passwd entry for each user/jail combination, with home directory set to the jail’s top directory, and the shell set to /sbin/chrootshell, e.g.
fred2:x:1000:1000:Fred Smith’s Jail:/home/fred/jail:/sbin/chrootshell

2.3. Create inner /etc/passwd entry
For each user created in the previous step, create a new /etc/passwd entry in the jail’s /etc/passwd. This entry should list a real home directory inside the jail, and a real shell, so it’ll be quite different from the one in the system /etc/passwd. For example:
fred2:x:1000:1000:Fred Smith In Jail:/home/fred2:/bin/sh

2.4. Test and Troubleshoot Jail Login
Try logging in as a jailed user via telnet, e.g.
$ telnet localhost
username: fred2
password: xxxx
$ ls

When you do an ls, you should see just the files in that user’s home directory inside the jail.
If telnet just closes the connection after login, first you need to figure out whether the problem is before or after chrootshell starts. Begin by listing the target system’s logfiles, e.g.

$ su
# cd /var/log
# ls -ltr

and examine the most recently change log files for telnet or pam permission problems. If you see any, it means you haven’t even gotten to chrootshell yet.
If it looks like chrootshell is started, but then aborts, you can turn on more helpful logging by uncommenting the line

/* #define DEBUG_PRINTS */

and recompiling and reinstalling chrootshell. This will cause verbose error messages to be sent directly to the file /var/log/chrootshell.log. The error messages tell you what line chrootshell.c is failing in, which should tell you enough to solve the problem.
3. Setting Up Berkeley r-utilities (rsh, rlogin, rcp)
For truly convenient (but insecure) access to the jail, you may want to set up the Berkely r-utilities (rsh, rlogin, and rcp). For instance, they are the native and probably preferred way of running the gcc and glibc test suites on remote embedded systems with tcp/ip connectivity.
Incidentally, rcp and rsh are the reason I used a C program for chrootshell rather than a shell script. A shell script version of chrootshell seems to fail to handle some of the remote commands

3.1. Installing r-utilities clients and servers
Beware: this is highly insecure!
To do this on a Red Hat or Debian Linux workstation, install the rsh-server and rsh-client packages. You can also do this by downloading, building, and installing ftp://ftp.gnu.org/gnu/inetutils/inetutils-1.4.2.tar.gz. (Don’t use version 1.4.1; the rshd in that version did not set up the remote environment properly.) For an embedded target, the only parts of inetutils you really need to build and install are rcp, rshd, and rlogind. (Yes, you need rcp even if you just want to run the server side.) sent by the gcc regression test due to quoting problems.

Once the binaries are installed on the target, you need to configure the system to run them.

rshd is run via inetd or xinetd or the like.
If your target system uses inetd, here’s an example line to add to /etc/inetd.conf:

shell stream tcp nowait.1000 root /bin/rshd

The “.1000” tells inetd to expect 1000 sessions per minute, which should be sufficient to handle the gcc regression tests. (If you leave this off, inetd will stop spawning rshd after about the 40th session in a one-minute period.)
If your target system uses xinetd, you probably need to set the “cps” field of /etc/xinetd.d/rsh to a large value such as “200 5” for xinet to handle the expected traffic. I haven’t tested this, but here’s what I think /etc/xinetd.d/rsh should look like:

service login
{
socket_type = stream
wait = no
cps = 200 5
user = root
server = /usr/sbin/in.rlogind
disable = no
}

rlogind is normally run standalone by giving the -d option. For instance, add this line to the target system’s startup scripts:

/bin/rlogind -d

If you want to allow remote access by root (which is highly insecure, but useful in limited situations, as you’ll see below), add the -o option.
3.2. Opening up a security hole for the r-utilities
If your systems use a firewall, you’ll need to open up TCP ports 513 (the ‘login’ service) and 514 (the ‘shell’ service). Note that this is a highly insecure thing to do, and should only be done inside a network entirely disconnected from the Internet or any large LAN, or at least well-shielded from it by another firewall.
If your system is using PAM, you may need to add entries in /etc/pam.d for rsh and rlogin. (If you installed the rsh-servers package, it probably added these entries for you.)

You probably need to add entries either to the systemwide /etc/hosts.equiv file (see ‘man hosts.equiv’) or the per-user .rhosts file (see ‘man rhosts’).

3.2. Testing and Troubleshooting rsh
First, verify you can run ‘ls’ remotely without the jail. Assuming the target system’s hostname is ‘target’, and assuming you have an account with the same name on both systems, give the following command on your workstation:
$ rsh target ls

If this doesn’t work, look at the most recently modified files in the target’s /var/log directory for interesting error messages. If that doesn’t explain the problem, try running tcpdump and looking at the packets being sent and received. If that doesn’t explain the problem, you can run rshd under strace to see what files it’s opening — maybe it’s not looking where you expected for security info. (This requires modifying /etc/inetd.conf to invoke strace -o /tmp/strace.log /bin/rshd instead of just /bin/rshd.) When all else fails, you could build rsh from source and step through it in the debugger.
Once that’s working, verify that you can spawn 200 commands in quick succession, e.g.

x=0; while test $x -lt 200; do rsh 11.3.4.1 ls; x=$(($x+1)); echo $x; done

and once that works, really overstress the system by verifying you can spawn 200 overlapping commands, e.g.
x=0; while test $x -lt 200; do rsh -n 11.3.1.1 ls & true ; x=$(($x+1)); echo $x; done

If you get the error
poll: protocol failure in circuit setup

around the 40th iteration, then you may need to edit /etc/inetd.conf and add .1000 to the end of the ‘nowait’ keyword (or edit /etc/xinit.d/rsh and add the cps parameter…) as described above.
3.3. Testing rlogin
Verify you can get a remote shell with either the command ‘rlogin target’ (or its synonym ‘rsh target’; rsh invokes rlogin if it’s run without a command).
3.4. Testing rcp
Verify you can copy files to the target system using rcp, e.g.
rcp /etc/hosts target:

4. Setting up chroot jails remotely
The entire reason I wrote this document was to help me achieve the goal of fully automated build-and-test of crosstoolchains. This requires a way to remotely script replacing the contents of a chroot jail with new system libraries. Once a jail has been set up for the user you plan to run the remote tests as, and it has passed all the above tests, you should be able to blow away the old jail’s contents, and recreate the jail remotely filled with the system shared libraries you plan to test. For example, assuming the toolchain you want to test is in /opt/my_toolchain, and the user ‘jailuser’ is set up to be inside the jail on the target, you can rebuild it with the following commands:
$ TARGET=11.3.4.1
$ rcp $JAILUSER@$TARGET:/jail/etc/passwd jail_etc_passwd
$ sh mkjail.sh result/sh4-unknown-linux-gnu/gcc-3.3-glibc-2.2.5/sh4-unknown-linux-gnu jail_etc_passwd
$ rcp initjail.sh root@$TARGET:
$ cat jail.tar.gz | rsh -l root $TARGET /bin/sh initjail.sh /jail

Then test the jail to make sure it still works, e.g.
$ rsh -l jailuser target ls
$ rcp /etc/hosts jailuser@target:/tmp

Summary
This document has shown how to set up a target system to allow remote login and file copy into accounts running in a chroot jail. If everything described here works properly, you should be all set to Run gcc regression tests remotely on the target system.
About this document
Ideas and code snippets taken variously from:
the very similar http://www.tjw.org/chroot-login-HOWTO, by Terry J. White and Brian Rhodes.
SVR4’s login’s feature whereby a “*” in the shell field of /etc/passwd triggered a chroot, http://www.mcsr.olemiss.edu/cgi-bin/man-cgi?login+1
Martin P. Ibert’s 1993 post, http://groups.google.com/groups?selm=HNLAB98U%40heaven7.in-berlin.de
Ulf Bartelt’s 1994 post, http://groups.google.com/groups?selm=1994Jun5.144526.9091%40solaris.rz.tu-clausthal.de
Tony White’s chroot-login-HOWTO from 2001, http://www.tjw.org/chroot-login-HOWTO/
Mike Makonnen’s June 2003 post, http://groups.google.com/groups?selm=bbeuh2%2416j0%241%40FreeBSD.csie.NCTU.edu.tw

Solaris Fault Management

Solaris Fault Management

The Solaris Fault Management Facility is designed to be integrated into the Service Management Facility to provide a self-healing capability to Solaris 10 systems.

The fmd daemon is responsible for monitoring several aspects of system health.

The fmadm config command shows the current configuration for fmd.

The Fault Manager logs can be viewed with fmdump -v and fmdump -e -v.

fmadm faulty will list any devices flagged as faulty.

fmstat shows statistics gathered by fmd.

Fault Management

With Solaris 10, Sun has implemented a daemon, fmd, to track and react to fault management. In addition to sending traditional syslog messages, the system sends binary telemetry events to fmd for correlation and analysis. Solaris 10 implements default fault management operations for several pieces of hardware in Sparc systems, including CPU, memory, and I/O bus events. Similar capabilities are being implemented for x64 systems.

Once the problem is defined, failing components may be offlined automatically without a system crash, or other corrective action may be taken by fmd. If a service dies as a result of the fault, the Service Management Facility (SMF) will attempt to restart it and any dependent processes.

The Fault Management Facility reports error messages in a well-defined and explicit format. Each error code is uniquely specified by a Universal Unique Identifier (UUID) related to a document on the Sun web site athttp://www.sun.com/msg/ .

Resources are uniquely identified by a Fault Managed Resource Identifier (FMRI). Each Field Replaceable Unit (FRU) has its own FMRI. FMRIs are associated with one of the following conditions:

  • ok: Present and available for use.
  • unknown: Not present or not usable, perhaps because it has been offlined or unconfigured.
  • degraded: Present and usable, but one or more problems have been identified.
  • faulted: Present but not usable; unrecoverable problems have been diagnosed and the resource has been disabled to prevent damage to the system.

The fmdump -V -u eventid command can be used to pull information on the type and location of the event. (The eventid is included in the text of the error message provided to syslog.) The -e option can be used to pull error log information rather than fault log information.

Statistical information on the performance of fmd can be viewed via the fmstatcommand. In particular, fmstat -m modulename provides information for a given module.

The fmadm command provides administrative support for the Fault Management Facility. It allows us to load and upload modules and view and update the resource cache. The most useful capabilities of fmadm are provided through the following subcommands:

  • config: Display the configuration of component modules.
  • faulty: Display faulted resources. With the -a option, list cached resource information. With the -i option, list persistent cache identifier information, instead of most recent state and UUID.
  • load /path/module: Load the module.
  • unload module: Unload module; the module name is the same as reported by fmadm config.
  • rotate logfile: Schedule rotation for the specified log file. Used with the logadm configuration file.

Additional Resources

Amy Rich’s Predictive Self-Healing Article

Mike Shapiro’s magazine article and presentation contain a good discussion of the architectural underpinnings of the Fault Manager.

Gavin Maltby reports on AMD fault management in this blog entry.

matty’s blog has a short introduction to Fault Management on Solaris.

How to Create a Link Aggregation – Bonding on Solaris

Before You Begin


Note – Link aggregation only works on full-duplex, point-to-point links that operate at identical speeds. Make sure that the interfaces in your aggregation conform to this requirement.


If you are using a switch in your aggregation topology, make sure that you have done the following on the switch:

  • Configured the ports to be used as an aggregation

  • If the switch supports LACP, configured LACP in either active mode or passive mode

  1. Assume the Primary Administrator role, or become superuser.

    The Primary Administrator role includes the Primary Administrator profile. To create the role and assign the role to a user, see Chapter 2, “Working With the Solaris Management Console (Tasks),” in System Administration Guide: Basic Administration.

  2. Determine which interfaces are currently installed on your system.

    # dladm show-link
  3. Determine which interfaces have been plumbed.

    # ifconfig -a
  4. Create an aggregation.

    # dladm create-aggr -d interface key

    interface

    Represents the device name of the interface to become part of the aggregation.

    key

    Is the number that identifies the aggregation. The lowest key number is 1. Zeroes are not allowed as keys.

    For example:

    # dladm create-aggr -d bge0 -d bge1 1
  5. Configure and plumb the newly created aggregation.

    # ifconfig aggrkey plumb IP-address up

    For example:

    # ifconfig aggr1  plumb 192.168.84.14 up
  6. Check the status of the aggregation you just created.

    # dladm show-aggr

    You receive the following output:

    key: 1 (0x0001) policy: L4      address: 0:3:ba:7:84:5e (auto)
               device   address           speed         duplex  link    state
               bge0     0:3:ba:7:84:5e    1000  Mbps    full    up      attached
               bge1     0:3:ba:7:84:5e    0     Mbps    unknown down    standby

    The output shows that an aggregation with the key of 1 and a policy of L4 was created. Note that the interfaces are known by the MAC address 0:3:ba:7:84:5e, which is the system MAC address.

  7. (Optional) Make the IP configuration of the link aggregation persist across reboots.

    1. For link aggregations with IPv4 addresses, create an /etc/hostname.aggr.key file. For IPv6-based link aggregations, create an /etc/hostname6.aggr.key file.

    2. Enter the IPv4 or IPv6 address of the link aggregation into the file.

      For example, you would create the following file for the aggregation that is created in this procedure:

      # vi /etc/hostname.aggr1
      192.168.84.14
    3. Perform a reconfiguration boot.

      # reboot -- -r
    4. Verify that the link aggregation configuration you entered in the /etc/hostname.aggrkey file has been configured.

      # ifconfig -a
      .
      .
      aggr1: flags=1000843 <UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
              inet 192.168.84.14 netmask ff000000 broadcast 192.255.255.

Example 6-4   Creating a Link Aggregation

This example shows the commands that are used to create a link aggregation with two devices, bge0 and bge1, and the resulting output.

# dladm show-link
ce0             type: legacy    mtu: 1500       device: ce0
ce1             type: legacy    mtu: 1500       device: ce1
bge0            type: non-vlan  mtu: 1500       device: bge0
bge1            type: non-vlan  mtu: 1500       device: bge1
bge2            type: non-vlan  mtu: 1500       device: bge2
# ifconfig -a
lo0: flags=2001000849 <UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        inet 127.0.0.1 netmask ff000000
ce0: flags=1000843 <UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
        inet 192.168.84.253 netmask ffffff00 broadcast 192.168.84.255
        ether 0:3:ba:7:84:5e
# dladm create-aggr -d bge0 -d bge1 1
# ifconfig aggr1 plumb 192.168.84.14 up
# dladm show-aggr
key: 1 (0x0001) policy: L4      address: 0:3:ba:7:84:5e (auto)
     device   address           speed         duplex  link    state
     bge0     0:3:ba:7:84:5e    1000  Mbps    full    up      attached
     bge1     0:3:ba:7:84:5e    0     Mbps    unknown down    standby

# ifconfig -a
lo0: flags=2001000849 <UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        inet 127.0.0.1 netmask ff000000
ce0: flags=1000843 <UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
        inet 192.168.84.253 netmask ffffff00 broadcast 192.168.84.255
        ether 0:3:ba:7:84:5e
aggr1: flags=1000843 <UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
        inet 192.168.84.14 netmask ff000000 broadcast 192.255.255.255
        ether 0:3:ba:7:84:5e

Note that the two interfaces that were used for the aggregation were not previously plumbed by ifconfig.

ProcedureHow to Modify an Aggregation

This procedure shows how to make the following changes to an aggregation definition:

  • Modifying the policy for the aggregation

  • Changing the mode for the aggregation

  • Removing an interface from the aggregation

  1. Assume the Primary Administrator role, or become superuser.

    The Primary Administrator role includes the Primary Administrator profile. To create the role and assign the role to a user, see Chapter 2, “Working With the Solaris Management Console (Tasks),” in System Administration Guide: Basic Administration.

  2. Modify the aggregation to change the policy.

    # dladm modify-aggr -Ppolicy key

    policy

    Represents one or more of the policies L2, L3, and L4, as explained in Policies and Load Balancing.

    key

    Is a number that identifies the aggregation. The lowest key number is 1. Zeroes are not allowed as keys.

  3. If LACP is running on the switch to which the devices in the aggregation are attached, modify the aggregation to support LACP.

    If the switch runs LACP in passive mode, be sure to configure active mode for your aggregation.

    # dladm modify-aggr -l LACP mode -t timer-value key

    -l LACP mode

    Indicates the LACP mode in which the aggregation is to run. The values are activepassive, and off.

    -t timer-value

    Indicates the LACP timer value, either short or long.

    key

    Is a number that identifies the aggregation. The lowest key number is 1. Zeroes are not allowed as keys.

Example 6-5   Modifying a Link Aggregation

This example shows how to modify the policy of aggregation aggr1 to L2 and then turn on active LACP mode.

# dladm modify-aggr -P L2 1
# dladm modify-aggr -l active -t short 1
# dladm show-aggr
key: 1 (0x0001) policy: L2      address: 0:3:ba:7:84:5e (auto)
     device   address           speed         duplex  link    state
     bge0     0:3:ba:7:84:5e    1000  Mbps    full    up      attached
     bge1     0:3:ba:7:84:5e    0     Mbps    unknown down    standby

fmdump man page

fmdump(1M)

NameSynopsisDescriptionOptionsOperandsExamplesExit StatusFilesAttributesSee Also |Notes

Name

    fmdump– fault management log viewer

Synopsis

    fmdump [-efmvV] [-c class] [-R dir] [-t time] [-T time]
         [-u uid] [-n name[.name]*[=value]] [file]

Description

    • The time of its diagnosis

    • A Universal Unique Identifier (UUID) that can be used to uniquely identify this particular problem across any set of systems

    • A message identifier that can be used to access a corresponding knowledge article located at Sun’s web site, http://www.sun.com/msg/

  • The fmdump utility can be used to display the contents of any of the log files associated with the Solaris Fault Manager, fmd(1M). The Fault Manager runs in the background on each Solaris system. It receives telemetry information relating to problems detected by the system software, diagnoses these problems, and initiates proactive self-healing activities such as disabling faulty components.

    The Fault Manager maintains two sets of log files for use by administrators and service personnel:

    error log

    A log which records error telemetry, the symptoms of problems detected by the system.

    fault log

    A log which records fault diagnosis information, the problems believed to explain these symptoms.

    By default, fmdump displays the contents of the fault log, which records the result of each diagnosis made by the fault manager or one of its component modules.

    An example of a default fmdump display follows:

    # fmdump
    TIME                 UUID                                 SUNW-MSG-ID
    Dec 28 13:01:27.3919 bf36f0ea-9e47-42b5-fc6f-c0d979c4c8f4 FMD-8000-11
    Dec 28 13:01:49.3765 3a186292-3402-40ff-b5ae-810601be337d FMD-8000-11
    Dec 28 13:02:59.4448 58107381-1985-48a4-b56f-91d8a617ad83 FMD-8000-OW
    ...

    Each problem recorded in the fault log is identified by:

    If a problem requires action by a human administrator or service technician or affects system behavior, the Fault Manager also issues a human-readable message to syslogd(1M). This message provides a summary of the problem and a reference to the knowledge article on the Sun web site,http://www.sun.com/msg/.

    You can use the -v and -V options to expand the display from a single-line summary to increased levels of detail for each event recorded in the log. The -c-t-T, and -u options can be used to filter the output by selecting only those events that match the specified class, range of times, or uuid.

    If more than one filter option is present on the command-line, the options combine to display only those events that are selected by the logical AND of the options. If more than one instance of the same filter option is present on the command-line, the like options combine to display any events selected by the logical OR of the options. For example, the command:

    # fmdump -u uuid1 -u uuid2 -t 02Dec03

    selects events whose attributes are (uuid1 OR uuid2AND (time on or after 02Dec03).

Options

    The following options are supported:

    -c class

    Select events that match the specified class. The class argument can use the glob pattern matching syntax described in sh(1). The class represents a hierarchical classification string indicating the type of telemetry event. More information about Sun’s telemetry protocol is available at Sun’s web site, http://www.sun.com/msg/.

    -e

    Display events from the fault management error log instead of the fault log. This option is shorthand for specifying the pathname of the error log file.

    The error log file contains Private telemetry information used by Sun’s automated diagnosis software. This information is recorded to facilitate post-mortem analysis of problems and event replay, and should not be parsed or relied upon for the development of scripts and other tools. Seeattributes(5) for information about Sun’s rules for Private interfaces.

    -f

    Follow the growth of the log file by waiting for additional data. fmdump enters an infinite loop where it will sleep for a second, attempt to read and format new data from the log file, and then go back to sleep. This loop can be terminated at any time by sending an interrupt (Control-C).

    -m

    Print the localized diagnosis message associated with each entry in the fault log.

    -n name[.name]*[=value]

    Select fault log or error log events, depending on the -e option, that have properties with a matching name (and optionally a matching value). For string properties the value can be a regular expression match. Regular expression syntax is described in the EXTENDED REGULAR EXPRESSIONS section of the regex(5) manual page. Be careful when using the characters:

    $  *  {  ^  |  (  )  \

    …or a regular expression, because these are meaningful to the shell. It is safest to enclose any of these in single quotes. For numeric properties, the value can be octal, hex, or decimal.

    -R dir

    Use the specified root directory for the log files accessed by fmdump, instead of the default root (/).

    -t time

    Select events that occurred at or after the specified time. The time can be specified using any of the following forms:

    mm/dd/yy hh:mm:ss

    Month, day, year, hour in 24-hour format, minute, and second. Any amount of whitespace can separate the date and time. The argument should be quoted so that the shell interprets the two strings as a single argument.

    mm/dd/yy hh:mm

    Month, day, year, hour in 24-hour format, and minute. Any amount of whitespace can separate the date and time. The argument should be quoted so that the shell interprets the two strings as a single argument.

    mm/dd/yy

    12:00:00AM on the specified month, day, and year.

    ddMonyy hh:mm:ss

    Day, month name, year, hour in 24-hour format, minute, and second. Any amount of whitespace can separate the date and time. The argument should be quoted so that the shell interprets the two strings as a single argument.

    ddMonyy hh:mm

    Day, month name, year, hour in 24-hour format, and minute. Any amount of whitespace can separate the date and time. The argument should be quoted so that the shell interprets the two strings as a single argument.

    Mon dd hh:mm:ss

    Month, day, hour in 24-hour format, minute, and second of the current year.

    yyyy-mm-dd [T hh:mm[:ss]]

    Year, month, day, and optional hour in 24-hour format, minute, and second. The second, or hour, minute, and second, can be optionally omitted.

    ddMonyy

    12:00:00AM on the specified day, month name, and year.

    hh:mm:ss

    Hour in 24-hour format, minute, and second of the current day.

    hh:mm

    Hour in 24-hour format and minute of the current day.

    TnsTnsec

    T nanoseconds ago where T is an integer value specified in base 10.

    Tus |Tusec

    T microseconds ago where T is an integer value specified in base 10.

    TmsTmsec

    T milliseconds ago where T is an integer value specified in base 10.

    Ts | Tsec

    T seconds ago where T is an integer value specified in base 10.

    Tm |Tmin

    T minutes ago where T is an integer value specified in base 10.

    Th |Thour

    T hours ago where T is an integer value specified in base 10.

    Td |Tday

    T days ago where T is an integer value specified in base 10.

    You can append a decimal fraction of the form .n to any -t option argument to indicate a fractional number of seconds beyond the specified time.

    -T time

    Select events that occurred at or before the specified time. time can be specified using any of the time formats described for the -t option.

    -u uuid

    Select fault diagnosis events that exactly match the specified uuid. Each diagnosis is associated with a Universal Unique Identifier (UUID) for identification purposes. The -u option can be combined with other options such as -v to show all of the details associated with a particular diagnosis.

    If the -e option and -u option are both present, the error events that are cross-referenced by the specified diagnosis are displayed.

    -v

    Display verbose event detail. The event display is enlarged to show additional common members of the selected events.

    -V

    Display very verbose event detail. The event display is enlarged to show every member of the name-value pair list associated with each event. In addition, for fault logs, the event display includes a list of cross-references to the corresponding errors that were associated with the diagnosis.

Operands

    The following operands are supported:

    file

    Specifies an alternate log file to display instead of the system fault log. The fmdump utility determines the type of the specified log automatically and produces appropriate output for the selected log.

Examples


    Example 1 Retrieving Given Class from fmd Log

    Use any of the following commands to retrieve information about a specified class from the fmd log. The complete class name is ereport.io.ddi.context.

    # fmdump -Ve -c 'ereport.io.ddi.context'
    # fmdump -Ve -c 'ereport.*.context'
    # fmdump -Ve -n 'class=ereport.io.ddi.context'
    # fmdump -Ve -n 'class=ereport.*.context'

    Any of the preceding commands produces the following output:

    Oct 06 2007 11:53:20.975021712 ereport.io.ddi.context
            nvlist version: 0
                    class = ereport.io.ddi.context
                    ena = 0x1b03a15ecf00001
                    detector = (embedded nvlist)
                    nvlist version: 0
                            version = 0x0
                            scheme = dev
                            device-path = /
                    (end detector)
    
                    __ttl = 0x1
                    __tod = 0x470706b0 0x3a1da690


    Example 2 Retrieving Specific Detector Device Path from fmd Log

    The following command retrieves a detector device path from the fmd log.

    # fmdump -Ve -n 'detector.device-path=.*/disk@1,0$'
    Oct 06 2007 12:04:28.065660760 ereport.io.scsi.disk.rqs
    nvlist version: 0
           class = ereport.io.scsi.disk.rqs
           ena = 0x453ff3732400401
           detector = (embedded nvlist)
                    nvlist version: 0
                            version = 0x0
                            scheme = dev
                            device-path = /pci@0,0/pci1000,3060@3/disk@1,0
                    (end detector)
    
                    __ttl = 0x1
                    __tod = 0x4707094c 0x3e9e758

Exit Status

    The following exit values are returned:

    0

    Successful completion. All records in the log file were examined successfully.

    1

    A fatal error occurred. This prevented any log file data from being examined, such as failure to open the specified file.

    2

    Invalid command-line options were specified.

    3

    The log file was opened successfully, but one or more log file records were not displayed, either due to an I/O error or because the records themselves were malformed. fmdump issues a warning message for each record that could not be displayed, and then continues on and attempts to display other records.

Files

    /var/fm/fmd

    Fault management log directory

    /var/fm/fmd/errlog

    Fault management error log

    /var/fm/fmd/fltlog

    Fault management fault log

Attributes

    See attributes(5) for descriptions of the following attributes:

    ATTRIBUTE TYPE

    ATTRIBUTE VALUE

    Availability

    SUNWfmd

    Interface Stability

    See below.

    The command-line options are Evolving. The human-readable error log output is Private. The human-readable fault log output is Evolving.

See Also

Notes

    Fault logs contain references to records stored in error logs that can be displayed using fmdump -V to understand the errors that were used in the diagnosis of a particular fault. These links are preserved if an error log is renamed as part of log rotation. They can be broken by removing an error log file, or by moving it to another filesystem directory. fmdump can not display error information for such broken links. It continues to display any and all information present in the fault log.

Installing nagios plugin and NRPE on Solaris 10

Configuring Nagios Plugins & NRPE on Solaris 10

Here’s a step by step installation of the Nagios plugin NRPE for Solaris 10 x86 (as the remote host):

useradd -c “nagios system user” -d /usr/local/nagios -m nagios
chown nagios:nagios /usr/local/nagios/
cd /usr/local/src # or wherever you like to put source code
wget http://internap.dl.sourceforge.net/sourceforge/nagios/nrpe-2.12.tar.gz
wget http://internap.dl.sourceforge.net/sourceforge/nagiosplug/nagios-plugins-1.4.11.tar.gz
gunzip nagios-plugins-1.4.11.tar.gz
tar -xvf nagios-plugins-1.4.11.tar
gunzip nrpe-2.12.tar.gz
tar -xvf nrpe-2.12.tar

First we’ll compile the nagios plugins:

cd nagios-plugins-1.4.11
./configure
make
make install
chown -R nagios:nagios /usr/local/nagios/libexec
cd ..

Run a quick check to make sure the plugins are working:

/usr/local/nagios/libexec/check_disk -w 10 -c 5 -p /

Next, we’ll compile NRPE. Normally at this point we would just run `cd nrpe-2.12; ./configure`. Unfortunately, the configure script can not find the SSH headers and libraries on Solaris 10. You get errors like this:

checking for SSL headers… configure: error: Cannot find ssl headers

checking for SSL libraries… configure: error: Cannot find ssl libraries

The answer to this is, of course, to tell configure where to find them:

cd nrpe-2.12
./configure –with-ssl=/usr/sfw/ –with-ssl-lib=/usr/sfw/lib/

Currently there is a bug in 2.12 that it assumes that all systems have 2 syslog facilities that Solaris doesn’t have, so if you try and compile it generates the following errors:

nrpe.c: In function `get_log_facility’:
nrpe.c:617: error: `LOG_AUTHPRIV’ undeclared (first use in this function)
nrpe.c:617: error: (Each undeclared identifier is reported only once
nrpe.c:617: error: for each function it appears in.)
nrpe.c:619: error: `LOG_FTP’ undeclared (first use in this function)
*** Error code 1
make: Fatal error: Command failed for target `nrpe’
Current working directory /usr/local/src/nrpe-2.12/src
*** Error code 1
make: Fatal error: Command failed for target `all’

Unfortunately, the fix at this time is to comment out the code that calls these two facilities, lines 616-619, in src/nrpe.c:

/*else if(!strcmp(varvalue,”authpriv”))
log_facility=LOG_AUTHPRIV;
else if(!strcmp(varvalue,”ftp”))
log_facility=LOG_FTP;*/

UPDATE: You no longer need to comment out these lines, just replace them with the following:

else if(!strcmp(varvalue,”authpriv”))
log_facility=LOG_AUTH;
else if(!strcmp(varvalue,”ftp”))
log_facility=LOG_DAEMON;

Now it will compile:

# make all
cd ./src/; make ; cd ..
gcc -g -O2 -I/usr/sfw//include/openssl -I/usr/sfw//include -DHAVE_CONFIG_H -o nrpe nrpe.c utils.c -L/usr/sfw/lib/ -lssl -lcrypto -lnsl -lsocket ./snprintf.o
gcc -g -O2 -I/usr/sfw//include/openssl -I/usr/sfw//include -DHAVE_CONFIG_H -o check_nrpe check_nrpe.c utils.c -L/usr/sfw/lib/ -lssl -lcrypto -lnsl -lsocket

*** Compile finished ***

Next install the new binaries:

# make install
cd ./src/ && make install
make install-plugin
.././install-sh -c -m 775 -o nagios -g nagios -d /usr/local/nagios/libexec
.././install-sh -c -m 775 -o nagios -g nagios check_nrpe /usr/local/nagios/libexec
make install-daemon
.././install-sh -c -m 775 -o nagios -g nagios -d /usr/local/nagios/bin
.././install-sh -c -m 775 -o nagios -g nagios nrpe /usr/local/nagios/bin

Optionally, if you want to use the sample config file run (Recommended if you don’t already have a standard config):

# make install-daemon-config
./install-sh -c -m 775 -o nagios -g nagios -d /usr/local/nagios/etc
./install-sh -c -m 644 -o nagios -g nagios sample-config/nrpe.cfg /usr/local/nagios/etc

Modify the nrpe.cfg file with your settings:

vi /usr/local/nagios/etc/nrpe.cfg

With Solaris 10, we don’t use either inetd or xinetd, but SMF. Thankfully, we can convert inetd entires into the SMF repository with the inetconv command. So first, add the following entry to /etc/services:

nrpe 5666/tcp # NRPE

Then add the following line to the end of /etc/inet/inetd.conf:

nrpe stream tcp nowait nagios /usr/sfw/sbin/tcpd /usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -i

Next, we need to convert it to SMF:

# inetconv
nrpe -> /var/svc/manifest/network/nrpe-tcp.xml
Importing nrpe-tcp.xml …Done
# inetconv -e
svc:/network/nrpe/tcp:default enabled

Check to make sure it went online:

# svcs svc:/network/nrpe/tcp:default
STATE STIME FMRI
online 15:53:39 svc:/network/nrpe/tcp:default
# netstat -a | grep nrpe
*.nrpe *.* 0 0 49152 0 LISTEN

Check the default installed parameters:

# inetadm -l svc:/network/nrpe/tcp:default
SCOPE NAME=VALUE
name=”nrpe”
endpoint_type=”stream”
proto=”tcp”
isrpc=FALSE
wait=FALSE
exec=”/usr/sfw/sbin/tcpd -c /usr/local/nagios/etc/nrpe.cfg -i”
arg0=”/usr/local/nagios/bin/nrpe”
user=”nagios”
default bind_addr=””
default bind_fail_max=-1
default bind_fail_interval=-1
default max_con_rate=-1
default max_copies=-1
default con_rate_offline=-1
default failrate_cnt=40
default failrate_interval=60
default inherit_env=TRUE
default tcp_trace=FALSE
default tcp_wrappers=FALSE
default connection_backlog=10

Change it so that it uses tcp_wrappers:

# inetadm -m svc:/network/nrpe/tcp:default tcp_wrappers=TRUE

And check to make sure it took effect:

# inetadm -l svc:/network/nrpe/tcp:default
SCOPE NAME=VALUE
name=”nrpe”
endpoint_type=”stream”
proto=”tcp”
isrpc=FALSE
wait=FALSE
exec=”/usr/sfw/sbin/tcpd -c /usr/local/nagios/etc/nrpe.cfg -i”
arg0=”/usr/local/nagios/bin/nrpe”
user=”nagios”
default bind_addr=””
default bind_fail_max=-1
default bind_fail_interval=-1
default max_con_rate=-1
default max_copies=-1
default con_rate_offline=-1
default failrate_cnt=40
default failrate_interval=60
default inherit_env=TRUE
default tcp_trace=FALSE
tcp_wrappers=TRUE
default connection_backlog=10

Modify your hosts.allow and hosts.deny to only allow your nagios server access to the NRPE port. Note that tcpd always looks at hosts.allow first, so even though we specify that everyone is rejected in the hosts.deny file, the ip addresses specified in hots.allow are allowed.
/etc/hosts.allow:

nrpe: LOCAL, 10.0.0.45

/etc/hosts.deny:

nrpe: ALL

Finally, check to make sure you have everything installed correctly (should return version information):

/usr/local/nagios/libexec/check_nrpe -H localhost
NRPE v2.12

Optionally, modify any firewalls between your nagios server and the remote host to allow port 5666.
Don’t forget to configure your nagios server to check your new service.

Follow

Get every new post delivered to your Inbox.