Nagios NRPE to Monitor Remote Linux Server

NRPE Remote Server Installation and Setup
Create Nagios user account on remote server to be monitored:

# useradd nagios
# passwd nagios
Download and Install Nagios Plugins:

# mkdir -p /opt/Nagios/Nagios_Plugins
# cd /opt/Nagios/Nagios_Plugins
Save file to directory /opt/Nagios

http://www.nagios.org/download/download.php

As of this writing Nagios 3.0.6 (Stable) and Nagios Plugins 1.4.13 (Stable)

Extract Files:

# tar xzf nagios-plugins-1.4.13.tar.gz

# cd nagios-plugins-1.4.13
Compile and Configure Nagios Plugins

** You need the openssl-devel package installed to compile plugins with ssl support. **

# yum -y install openssl-devel
Instal Plugins:

# ./configure –with-nagios-user=nagios –with-nagios-group=nagios
# make
# make install
The permissions on the plugin directory and the plugins will need to be changed to nagios user

# chown nagios.nagios /usr/local/nagios
# chown -R nagios.nagios /usr/local/nagios/libexec
Package xinted is needed

# yum install xinetd
Downlad and Install NRPE Daemon

# mkdir -p /opt/Nagios/Nagios_NRPE
# cd /opt/Nagios/Nagios_NRPE
Save file to directory /opt/Nagios

http://www.nagios.org/download/download.php

As of this writing NRPE 2.12 (Stable)

Extract the Files:

# tar -xzf nrpe-2.12.tar.gz
# cd nrpe-2.12
Compile and Configure NRPE

** You need the openssl-devel package installed to compile NRPE with ssl support. **

# yum -y install openssl-devel
Install NRPE:

# ./configure

General Options:
————————-
NRPE port: 5666
NRPE user: nagios
NRPE group: nagios
Nagios user: nagios
Nagios group: nagios

# make all

# make install-plugin

# make install-daemon

# make install-daemon-config

# make install-xinetd
Post NRPE Configuration

Edit Xinetd NRPE entry:

Add Nagios Monitoring server to the “only_from” directive

# vi /etc/xinetd.d/nrpe

only_from = 127.0.0.1
Edit services file entry:

Add entry for nrpe daemon

# vi /etc/services

nrpe 5666/tcp # NRPE
Restart Xinetd and Set to start at boot:

# chkconfig xinetd on

# service xinetd restart
Test NRPE Daemon Install

Check NRPE daemon is running and listening on port 5666:

# netstat -at |grep nrpe
Output should be:

tcp 0 0 *:nrpe *.* LISTEN
Check NRPE daemon is functioning:

# /usr/local/nagios/libexec/check_nrpe -H localhost
Output should be NRPE version:

NRPE v2.12
Open Port 5666 on Firewall

Make sure to open port 5666 on the firewall of the remote server so that the Nagios monitoring server can access the NRPE daemon.

Nagios Monitoring Host Server Setup
Downlad and Install NRPE Plugin

# mkdir -p /opt/Nagios/Nagios_NRPE
# cd /opt/Nagios/Nagios_NRPE
Save file to directory /opt/Nagios

http://www.nagios.org/download/download.php

As of this writing NRPE 2.12 (Stable)

Extract the Files:

# tar -xzf nrpe-2.12.tar.gz
# cd nrpe-2.12
Compile and Configure NRPE

# ./configure

# make all

# make install-plugin
Test Connection to NRPE daemon on Remote Server

Lets now make sure that the NRPE on our Nagios server can talk to the NRPE daemon on the remote server we want to monitor. Replace “” with the remote servers IP address.

# /user/local/nagios/libexec/check_nrpe -H
NRPE v2.12
Create NRPE Command Definition

A command definition needs to be created in order for the check_nrpe plugin to be used by nagios.

# vi /usr/local/nagios/etc/objects/commands.cfg
Add the following:

###############################################################################
# NRPE CHECK COMMAND
#
# Command to use NRPE to check remote host systems
###############################################################################

define command{
command_name check_nrpe
command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$
}
Create Linux Object template

In order to be able to add the remote linux machine to Nagios we need to create an object template file adn add some object definitions.

Create new linux-box-remote object template file:

# vi /usr/local/nagios/etc/objects/linux-box-remote.cfg
Add the following and replace the values “host_name” “alias” “address” with the values that match your setup:

** The “host_name” you set for the “define_host” section must match the “host_name” in the “define_service” section **

define host{
name linux-box-remote ; Name of this template
use generic-host ; Inherit default values
check_period 24×7
check_interval 5
retry_interval 1
max_check_attempts 10
check_command check-host-alive
notification_period 24×7
notification_interval 30
notification_options d,r
contact_groups admins
register 0 ; DONT REGISTER THIS – ITS A TEMPLATE
}

define host{
use linux-box-remote ; Inherit default values from a template
host_name Centos5 ; The name we’re giving to this server
alias Centos5 ; A longer name for the server
address 192.168.0.5 ; IP address of the server
}

define service{
use generic-service
host_name Centos5
service_description CPU Load
check_command check_nrpe!check_load
}
define service{
use generic-service
host_name Centos5
service_description Current Users
check_command check_nrpe!check_users
}
define service{
use generic-service
host_name Centos5
service_description /dev/hda1 Free Space
check_command check_nrpe!check_hda1
}
define service{
use generic-service
host_name Centos5
service_description Total Processes
check_command check_nrpe!check_total_procs
}
define service{
use generic-service
host_name Centos5
service_description Zombie Processes
check_command check_nrpe!check_zombie_procs
}
Activate the linux-box-remote.cfg template:

# vi /usr/local/nagios/etc/nagios.cfg
And add:

# Definitions for monitoring remote Linux machine
cfg_file=/usr/local/nagios/etc/objects/linux-box-remote.cfg
Verify Nagios Configuration Files:

# /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
Total Warnings: 0
Total Errors: 0
Restart Nagios:

# service nagios restart
Check Nagios Monitoring server that the remote linux box was added and is being monitored !

TroubleShooting
NRPE ./configure error:

checking for SSL headers… configure: error: Cannot find ssl headers
Solution:

You need to install the openssl-devel package

# yum -y install openssl-devel
CHECK_NRPE: Error – Could not complete SSL handshake
Solution:

This is most likely not a probem with SSL but rather with Xinetd access restrictions.

Check the following files:

/etc/xinetd.d/nrpe

/etc/hosts.allow

/etc/hosts.deny

Howto add an disk to Solaris 10

Howto add an disk to Solaris 10 6/06
Written by Administrator
Friday, 14 July 2006
Howto add an SATA disk to Solaris 10 6/06

An common task is to connect another SATA disk to the system. When you’re used on running Linux, you will find it logic that Linux just sees the disk, and you can fdisk it right away. Solaris is just another story.

To let solaris know a new disk is added I run the cmd :

devfsadm -vC

It scans the system and will add and remove new and old devices. When the cmd completes, time for another cmd :

format

This cmd let you choose a disk to format, and create an Solaris label on the disk. After creating the label, have fun by slicing the disk. (It’s a killer 🙂 Preparing an disk is done like :

# format

Searching for disks…done

AVAILABLE DISK SELECTIONS:

0. c2d0

/pci@0,0/pci-ide@8/ide@1/cmdk@0,0

1. c3d0

/pci@0,0/pci-ide@7/ide@0/cmdk@0,0

Specify disk (enter its number): 1

selecting c3d0

Controller working list found

[disk formatted, defect list found]

FORMAT MENU:

disk – select a disk

type – select (define) a disk type

partition – select (define) a partition table

current – describe the current disk

format – format and analyze the disk

fdisk – run the fdisk program

repair – repair a defective sector

show – translate a disk address

label – write label to the disk

analyze – surface analysis

defect – defect list management

backup – search for backup labels

verify – read and display labels

save – save new disk/partition definitions

volname – set 8-character volume name

! – execute , then return

quit

format> fdisk

No fdisk table exists. The default partition for the disk is:

a 100% “SOLARIS System” partition

Type “y” to accept the default partition, otherwise type “n” to edit the

partition table.

y

format> lABEL

`lABEL’ is not expected.

format> label

Ready to label disk, continue? y

format> verify

Warning: Primary label on disk appears to be different from

current label.

Warning: Check the current partitioning and ‘label’ the disk or use the

‘backup’ command.

Primary label contents:

Volume name =

ascii name =

pcyl = 30400

ncyl = 30398

acyl = 2

bcyl = 0

nhead = 255

nsect = 63

Part Tag Flag Cylinders Size Blocks

0 unassigned wm 0 0 (0/0/0) 0

1 unassigned wm 0 0 (0/0/0) 0

2 backup wu 0 – 30397 232.86GB (30398/0/0) 488343870

3 unassigned wm 0 0 (0/0/0) 0

4 unassigned wm 0 0 (0/0/0) 0

5 unassigned wm 0 0 (0/0/0) 0

6 unassigned wm 0 0 (0/0/0) 0

7 unassigned wm 0 0 (0/0/0) 0

8 boot wu 0 – 0 7.84MB (1/0/0) 16065

9 alternates wm 1 – 2 15.69MB (2/0/0) 32130

format> par

PARTITION MENU:

0 – change `0′ partition

1 – change `1′ partition

2 – change `2′ partition

3 – change `3′ partition

4 – change `4′ partition

5 – change `5′ partition

6 – change `6′ partition

7 – change `7′ partition

select – select a predefined table

modify – modify a predefined partition table

name – name the current table

print – display the current table

label – write partition map and label to the disk

! – execute , then return

quit

partition> mod

Select partitioning base:

0. Current partition table (original)

1. All Free Hog

Choose base (enter number) [0]? 1

Part Tag Flag Cylinders Size Blocks

0 root wm 0 0 (0/0/0) 0

1 swap wu 0 0 (0/0/0) 0

2 backup wu 0 – 30397 232.86GB (30398/0/0) 488343870

3 unassigned wm 0 0 (0/0/0) 0

4 unassigned wm 0 0 (0/0/0) 0

5 unassigned wm 0 0 (0/0/0) 0

6 usr wm 0 0 (0/0/0) 0

7 unassigned wm 0 0 (0/0/0) 0

8 boot wu 0 – 0 7.84MB (1/0/0) 16065

9 alternates wm 1 – 2 15.69MB (2/0/0) 32130

Do you wish to continue creating a new partition

table based on above table[yes]?

Free Hog partition[6]?

Enter size of partition ‘0’ [0b, 0c, 0.00mb, 0.00gb]:

Enter size of partition ‘1’ [0b, 0c, 0.00mb, 0.00gb]:

Enter size of partition ‘3’ [0b, 0c, 0.00mb, 0.00gb]:

Enter size of partition ‘4’ [0b, 0c, 0.00mb, 0.00gb]:

Enter size of partition ‘5’ [0b, 0c, 0.00mb, 0.00gb]:

Enter size of partition ‘7’ [0b, 0c, 0.00mb, 0.00gb]:

Part Tag Flag Cylinders Size Blocks

0 root wm 0 0 (0/0/0) 0

1 swap wu 0 0 (0/0/0) 0

2 backup wu 0 – 30397 232.86GB (30398/0/0) 488343870

3 unassigned wm 0 0 (0/0/0) 0

4 unassigned wm 0 0 (0/0/0) 0

5 unassigned wm 0 0 (0/0/0) 0

6 usr wm 3 – 30397 232.84GB (30395/0/0) 488295675

7 unassigned wm 0 0 (0/0/0) 0

8 boot wu 0 – 0 7.84MB (1/0/0) 16065

9 alternates wm 1 – 2 15.69MB (2/0/0) 32130

Okay to make this the current partition table[yes]?

Enter table name (remember quotes): “datadsk”

Ready to label disk, continue? y

partition> q

FORMAT MENU:

disk – select a disk

type – select (define) a disk type

partition – select (define) a partition table

current – describe the current disk

format – format and analyze the disk

fdisk – run the fdisk program

repair – repair a defective sector

show – translate a disk address

label – write label to the disk

analyze – surface analysis

defect – defect list management

backup – search for backup labels

verify – read and display labels

save – save new disk/partition definitions

volname – set 8-character volume name

! – execute , then return

quit

format> save

Saving new disk and partition definitions

Enter file name[“./format.dat”]:

format> verify

Warning: Primary label on disk appears to be different from

current label.

Warning: Check the current partitioning and ‘label’ the disk or use the

‘backup’ command.

Primary label contents:

Volume name =

ascii name =

pcyl = 30400

ncyl = 30398

acyl = 2

bcyl = 0

nhead = 255

nsect = 63

Part Tag Flag Cylinders Size Blocks

0 unassigned wm 0 0 (0/0/0) 0

1 unassigned wm 0 0 (0/0/0) 0

2 backup wu 0 – 30397 232.86GB (30398/0/0) 488343870

3 unassigned wm 0 0 (0/0/0) 0

4 unassigned wm 0 0 (0/0/0) 0

5 unassigned wm 0 0 (0/0/0) 0

6 unassigned wm 3 – 30397 232.84GB (30395/0/0) 488295675

7 unassigned wm 0 0 (0/0/0) 0

8 boot wu 0 – 0 7.84MB (1/0/0) 16065

9 alternates wm 1 – 2 15.69MB (2/0/0) 32130

format> q

Two more steps to go, first a filesystem is needed, and after that the disk must be mounted. Creating an filesystem is easy :

# newfs /dev/dsk/c3d0s6
newfs: construct a new file system /dev/rdsk/c3d0s6: (y/n)? y
Warning: 4870 sector(s) in last cylinder unallocated
/dev/rdsk/c3d0s6: 488295674 sectors in 79476 cylinders of 48 tracks, 128 sectors
238425.6MB in 4968 cyl groups (16 c/g, 48.00MB/g, 5824 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
32, 98464, 196896, 295328, 393760, 492192, 590624, 689056, 787488, 885920,
Initializing cylinder groups:
…………………………………………………………………….
………………..
super-block backups for last 10 cylinder groups at:
487395104, 487493536, 487591968, 487690400, 487788832, 487887264, 487985696,
488084128, 488182560, 488280992

And mounting the disk :

# mount /dev/dsk/c3d0s6 /mnt

# df -h /mnt/
Filesystem size used avail capacity Mounted on
/dev/dsk/c3d0s6 229G 64M 227G 1% /mnt

Well that whas easy, wasn’t it ?

prtvtoc /dev/dsk/c1t1d0s2
To checked partition information

Add new partition details on /etc/fstab to work after reboot

/dev/dsk/c1t1d0s6 /dev/rdsk/c1t1d0s6 /oradata ufs 2 yes logging
/dev/dsk/c1t2d0s6 /dev/rdsk/c1t2d0s6 /orabackup ufs 2 yes logging

How to create a self-signed SSL Certificate

Overview

The following is an extremely simplified view of how SSL is implemented and what part the certificate plays in the entire process.

Normal web traffic is sent unencrypted over the Internet. That is, anyone with access to the right tools can snoop all of that traffic. Obviously, this can lead to problems, especially where security and privacy is necessary, such as in credit card data and bank transactions. The Secure Socket Layer is used to encrypt the data stream between the web server and the web client (the browser).

SSL makes use of what is known as asymmetric cryptography, commonly referred to as public key cryptography (PKI). With public key cryptography, two keys are created, one public, one private. Anything encrypted with either key can only be decrypted with its corresponding key. Thus if a message or data stream were encrypted with the server’s private key, it can be decrypted only using its corresponding public key, ensuring that the data only could have come from the server.

If SSL utilizes public key cryptography to encrypt the data stream traveling over the Internet, why is a certificate necessary? The technical answer to that question is that a certificate is not really necessary – the data is secure and cannot easily be decrypted by a third party. However, certificates do serve a crucial role in the communication process. The certificate, signed by a trusted Certificate Authority (CA), ensures that the certificate holder is really who he claims to be. Without a trusted signed certificate, your data may be encrypted, however, the party you are communicating with may not be whom you think. Without certificates, impersonation attacks would be much more common.

Step 1: Generate a Private Key

The openssl toolkit is used to generate an RSA Private Key and CSR (Certificate Signing Request). It can also be used to generate self-signed certificates which can be used for testing purposes or internal usage.

The first step is to create your RSA Private Key. This key is a 1024 bit RSA key which is encrypted using Triple-DES and stored in a PEM format so that it is readable as ASCII text.

openssl genrsa -des3 -out server.key 1024

Generating RSA private key, 1024 bit long modulus
…………………………………………………++++++
……..++++++
e is 65537 (0x10001)
Enter PEM pass phrase:
Verifying password – Enter PEM pass phrase:

Step 2: Generate a CSR (Certificate Signing Request)

Once the private key is generated a Certificate Signing Request can be generated. The CSR is then used in one of two ways. Ideally, the CSR will be sent to a Certificate Authority, such as Thawte or Verisign who will verify the identity of the requestor and issue a signed certificate. The second option is to self-sign the CSR, which will be demonstrated in the next section.

During the generation of the CSR, you will be prompted for several pieces of information. These are the X.509 attributes of the certificate. One of the prompts will be for “Common Name (e.g., YOUR name)”. It is important that this field be filled in with the fully qualified domain name of the server to be protected by SSL. If the website to be protected will be https://public.akadia.com, then enter public.akadia.com at this prompt. The command to generate the CSR is as follows:

openssl req -new -key server.key -out server.csr

Country Name (2 letter code) [GB]:CH
State or Province Name (full name) [Berkshire]:Bern
Locality Name (eg, city) [Newbury]:Oberdiessbach
Organization Name (eg, company) [My Company Ltd]:Akadia AG
Organizational Unit Name (eg, section) []:Information Technology
Common Name (eg, your name or your server’s hostname) []:public.akadia.com
Email Address []:martin dot zahn at akadia dot ch
Please enter the following ‘extra’ attributes
to be sent with your certificate request
A challenge password []:
An optional company name []:

Step 3: Remove Passphrase from Key

One unfortunate side-effect of the pass-phrased private key is that Apache will ask for the pass-phrase each time the web server is started. Obviously this is not necessarily convenient as someone will not always be around to type in the pass-phrase, such as after a reboot or crash. mod_ssl includes the ability to use an external program in place of the built-in pass-phrase dialog, however, this is not necessarily the most secure option either. It is possible to remove the Triple-DES encryption from the key, thereby no longer needing to type in a pass-phrase. If the private key is no longer encrypted, it is critical that this file only be readable by the root user! If your system is ever compromised and a third party obtains your unencrypted private key, the corresponding certificate will need to be revoked. With that being said, use the following command to remove the pass-phrase from the key:

cp server.key server.key.org
openssl rsa -in server.key.org -out server.key

The newly created server.key file has no more passphrase in it.

-rw-r–r– 1 root root 745 Jun 29 12:19 server.csr
-rw-r–r– 1 root root 891 Jun 29 13:22 server.key
-rw-r–r– 1 root root 963 Jun 29 13:22 server.key.org

Step 4: Generating a Self-Signed Certificate

At this point you will need to generate a self-signed certificate because you either don’t plan on having your certificate signed by a CA, or you wish to test your new SSL implementation while the CA is signing your certificate. This temporary certificate will generate an error in the client browser to the effect that the signing certificate authority is unknown and not trusted.

To generate a temporary certificate which is good for 365 days, issue the following command:

openssl x509 -req -days 365 -in server.csr -signkey server.key -out server.crt
Signature ok
subject=/C=CH/ST=Bern/L=Oberdiessbach/O=Akadia AG/OU=Information
Technology/CN=public.akadia.com/Email=martin dot zahn at akadia dot ch
Getting Private key

Step 5: Installing the Private Key and Certificate

When Apache with mod_ssl is installed, it creates several directories in the Apache config directory. The location of this directory will differ depending on how Apache was compiled.

cp server.crt /usr/local/apache/conf/ssl.crt
cp server.key /usr/local/apache/conf/ssl.key

Step 6: Configuring SSL Enabled Virtual Hosts

SSLEngine on
SSLCertificateFile /usr/local/apache/conf/ssl.crt/server.crt
SSLCertificateKeyFile /usr/local/apache/conf/ssl.key/server.key
SetEnvIf User-Agent “.*MSIE.*” nokeepalive ssl-unclean-shutdown
CustomLog logs/ssl_request_log \
“%t %h %{SSL_PROTOCOL}x %{SSL_CIPHER}x \”%r\” %b”

Step 7: Restart Apache and Test

/etc/init.d/httpd stop
/etc/init.d/httpd stop

https://public.akadia.com

How to find the WWN (World Wide Name) in Sun Solaris

World Wide Name (WWN) are unique 8 byte (64-bit) identifiers in SCSI or fibre channel similar to that of MAC Addresses on a Network Interface Card (NIC).
Talking about the WWN names, there are also
World Wide port Name (WWpN), a WWN assigned to a port on a Fabric which is what you would be looking for most of the time.
World Wide node Name (WWnN), a WWN assigned to a node/device on a Fibre Channel fabric.
To find the WWN numbers of your HBA card in Sun Solaris, you can use one the following procedures
Using fcinfo (Solaris 10 only)
This is probably the easiest way to find the WWN numbers on your HBA card. Here you can see the HBA Port WWN (WWpN) and the Node WWN (WWnN) of the two ports on the installed Qlogic HAB card.
This is also useful in finding the Model number, Firmwar version FCode, supported and current speeds and the port status of the HBA card/port.

root@ sunserver:/root# fcinfo hba-port | grep WWN
HBA Port WWN: 2100001b32xxxxxx
Node WWN: 2000001b32xxxxxx
HBA Port WWN: 2101001b32yyyyyy
Node WWN: 2001001b32yyyyyy
For detailed info including Make & model number, Firmware, Fcode and current status and supported/current speeds then
root@ sunserver:/root# fcinfo hba-port
HBA Port WWN: 2100001b32xxxxxx
OS Device Name: /dev/cfg/c2
Manufacturer: QLogic Corp.
Model: 375-3356-02
Firmware Version: 4.04.01
FCode/BIOS Version: BIOS: 1.24; fcode: 1.24; EFI: 1.8;
Type: N-port
State: online
Supported Speeds: 1Gb 2Gb 4Gb
Current Speed: 4Gb
Node WWN: 2000001b32xxxxxx
HBA Port WWN: 2101001b32yyyyyy
OS Device Name: /dev/cfg/c3
Manufacturer: QLogic Corp.
Model: 375-3356-02
Firmware Version: 4.04.01
FCode/BIOS Version: BIOS: 1.24; fcode: 1.24; EFI: 1.8;
Type: unknown
State: offline
Supported Speeds: 1Gb 2Gb 4Gb
Current Speed: not established
Node WWN: 2001001b32yyyyyy

Using scli

root@ sunserver:/root# scli -i | egrep “Node Name|Port Name”
Node Name : 20-00-00-1B-32-XX-XX-XX
Port Name : 21-00-00-1B-32-XX-XX-XX
Node Name : 20-01-00-1B-32-YY-YY-YY
Port Name : 21-01-00-1B-32-YY-YY-YY

For more detailed info on the HBA Cards run as follows: Similar to fcinfo but also provides Model Name and serial number.

root@ sunserver:/root# scli -i
——————————————————————————
Host Name : sunserver
HBA Model : QLE2462
HBA Alias :
Port : 1
Port Alias :
Node Name : 20-00-00-1B-32-XX-XX-XX
Port Name : 21-00-00-1B-32-XX-XX-XX
Port ID : 11-22-33
Serial Number : AAAAAAA-bbbbbbbbbb
Driver Version : qlc-20080514-2.28
FCode Version : 1.24
Firmware Version : 4.04.01
HBA Instance : 2
OS Instance : 2
HBA ID : 2-QLE2462
OptionROM BIOS Version : 1.24
OptionROM FCode Version : 1.24
OptionROM EFI Version : 1.08
OptionROM Firmware Version : 4.00.26
Actual Connection Mode : Point to Point
Actual Data Rate : 2 Gbps
PortType (Topology) : NPort
Total Number of Devices : 2
HBA Status : Online
——————————————————————————
Host Name : sunserver
HBA Model : QLE2462
HBA Alias :
Port : 2
Port Alias :
Node Name : 20-01-00-1B-32-YY-YY-YY
Port Name : 21-01-00-1B-32-YY-YY-YY
Port ID : 00-00-00
Serial Number : AAAAAAA-bbbbbbbbbb
Driver Version : qlc-20080514-2.28
FCode Version : 1.24
Firmware Version : 4.04.01
HBA Instance : 3
OS Instance : 3
HBA ID : 3-QLE2462
OptionROM BIOS Version : 1.24
OptionROM FCode Version : 1.24
OptionROM EFI Version : 1.08
OptionROM Firmware Version : 4.00.26
Actual Connection Mode : Unknown
Actual Data Rate : Unknown
PortType (Topology) : Unidentified
Total Number of Devices : 0
HBA Status : Loop down

Using prtconf
root@ sunserver:/root# prtconf -vp | grep -i wwn
port-wwn: 2100001b.32xxxxxx
node-wwn: 2000001b.32xxxxxx
port-wwn: 2101001b.32yyyyyy
node-wwn: 2001001b.32yyyyyy
Using prtpicl
root@ sunserver:/root# prtpicl -v | grep wwn
:node-wwn 20 00 00 1b 32 xx xx xx
:port-wwn 21 00 00 1b 32 xx xx xx
:node-wwn 20 01 00 1b 32 yy yy yy
:port-wwn 21 01 00 1b 32 yy yy yy

Using luxadm
Run the following command to obtain the physical path to the HBA Ports
root@ sunserver:/root$ luxadm -e port
/devices/pci@400/pci@0/pci@9/SUNW,qlc@0/fp@0,0:devctl CONNECTED
/devices/pci@400/pci@0/pci@9/SUNW,qlc@0,1/fp@0,0:devctl NOT CONNECTED

With the physical path obtained from the above command, we can trace the WWN numbers as follows. here I use the physical path to the one that is connected:
root@ sunserver:/root$ luxadm -e dump_map /devices/pci@400/pci@0/pci@9/SUNW,qlc@0/fp@0,0:devctl
Pos Port_ID Hard_Addr Port WWN Node WWN Type
0 123456 0 1111111111111111 2222222222222222 0×0 (Disk device)
1 789123 0 1111111111111111 2222222222222222 0×0 (Disk device)
2 453789 0 2100001b32xxxxxx 2000001b32xxxxxx 0×1f (Unknown Type,Host Bus Adapter)

Hope this helps. If you know of any more way then please feel free to post it to the comments and I shall amend it to the article.

Solaris 10 memory usage analysis

Solaris 10 memory usage analysis

It so happens that I need to get a bit more insight into what’s eating all the RAM on one of my solaris boxes. Whenever this happens I can never remember all the various incantations, so I’m putting them all here for future reference.
Most of these need to run as root.

prstat -a -s rss
– Quick overview of top processes ordered by physical memory consumption, plus memory consumption per user. Note that if you have lots of processes all sharing (say) a 1GB bit of shard memory, each process will show up as using 1GB (very noticeable with oracle, where there can be 100 processes each with a hook into the multi-GB SGA)

ls -l /proc/{pid}/as
nice easy way to see the address space (total memory usage) for a single process. Good for when you want to see the memory usage of a set of processes which is too large to fit into prstat e.g. _

# is apache leaking?
for pid in `pgrep httpd`
do
ls -l /proc/$pid/as
done
vmstat -S 3
Am I swapping? watch the swap in/out columns; if they’re not 0, you need more RAM

vmstat 3
Am I thinking about swapping? The sr (Scan Rate) column tells you when you’re starting to run low on memory, and the kernel is scanning physical memory to find blocks that can be swapped out. c.f. Myth: Using swap is bad for performance

echo “::memstat” | mdb -k
How much memory is being used by the kernel and/or the (UFS) file system caches? (n.b. the kernel memory usage includes the ZFS ARC cache – see below) Warning this can take several minutes to run, and sucks up a lot of CPU time.

kstat -m zfs
How much memory is the ZFS ARC cache using? (n.b. if you have lots of ZFS data, this can be a very big number; the ARC cache will use up to (system RAM -1GB), but it should release RAM as soon as other apps need it.

chroot login HOWTO

chroot login HOWTO
Introduction
Testing new cross-toolchains properly requires overriding your target system’s core shared libraries with the newly created, and probably buggy, ones. Doing this without nuking your target system requires creating a sandbox of some sort. This document describes one way to set up a chroot jail on your target system suitable for running gcc and glibc remote regression tests (or many other purposes).
1. Set Up and Test a Jail
Two shell scripts are provided as an example of how to set up a jail on linux. Please read and understand them. They are provided as executable shell scripts, but they should not be run blindly. You may need to edit the two scripts to reflect your target system’s peculiarities and the needs of the programs you intend to run in the jail. The scripts as furnished have been tested both on Debian x86 and on an embedded PPC Linux system based on busybox, and contain everything needed to run the gcc and glibc regression tests (which isn’t much). They assume that either busybox or ash is available, so if you’re running this on a workstation, you’ll need install ash before running them. (ash is preferred over bash simply because it has fewer dependencies; you can use bash, but you’ll need to modify the scripts to add the additional libraries used by bash, e.g. ncurses.)
The first shell script, mkjail.sh, makes a tarball containing core shared libraries and the jail’s etc/passwd file. The second shell script, initjail.sh, unpacks that tarball into the jail, and adds crucial /dev entries, a /proc filesystem, /etc files, core programs like sh, and non-toolchain shared libraries, and appends a given file to the jail’s /etc/passwd file.

The two-step process is exactly what you need when testing cross-toolchains; you run the first script on the development system, and the second script on the target system. However, you should run them both on the target system initially and verify that the jail works before testing them with your newly compiled and probably buggy toolchain’s shared libraries.

1.1. Setting up a jail
For example, here’s how to use the scripts to set up a minimal jail containing a copy of the system libraries and system /etc/passwd file:
$ sh mkjail.sh / /etc/passwd
$ su
# zcat jail.tar.gz | sh initjail.sh myjail

1.2. Testing the Jail
Once you have the jail set up, test it by hand using the /usr/sbin/chroot program to run a shell inside the jail, e.g.
$ su
# /usr/sbin/chroot `pwd`/myjail /bin/sh

(Note: the argument to chroot must be an absolute path, else exec fails. This appears to be a bug in Linux.) If you can’t run a shell inside the jail, try running something easier, e.g. /bin/true (the simplest program there is):
$ su
# /usr/sbin/chroot `pwd`/myjail /bin/true

Once you get that working, go back to trying the shell. The most common cause of programs not running in the jail is missing shared libraries; to fix that, just copy the missing libraries from the system into the corresponding place in the jail. (Note that shared libraries consist of one or two files, and zero or more symbolic links; take care to not follow symbolic links when copying. In the provided scripts, I use the -d option to /bin/cp to avoid dereferencing links.)
2. Configure Jail Login Scheme
Specific users can be configured such that the moment they log in, a wrapper program (see chrootshell.c) jails the user in his home directory using the chroot system call, looks up his record in the jail’s private /etc/passwd file, and uses it to set his current directory and transfer control to his preferred shell. Every program and shared library the user executes then comes from the jail, not from the surrounding system.
Only users who have the root password can set up jails, partly because setting up a jail requires mounting /proc inside the jail. Users with the root password can set up jails on behalf of lesser users (though you might want to take away their ability to run setuid root programs if you do that, else they might be able to get out of the jail…)

2.1. Install chrootshell
chrootshell will be installed setuid root, so first audit the source (chrootshell.c) for security problems, then compile and install it with the commands
$ cc chrootshell.c -o chrootshell
$ su
# install -o root -m 4755 chrootshell /sbin/chrootshell

You probably need to add the line
/sbin/chrootshell

to /etc/shells, else login may refuse to run it.
2.2. Create outer /etc/passwd entry
For each user you want to give access to a jail, create a new /etc/passwd entry for each user/jail combination, with home directory set to the jail’s top directory, and the shell set to /sbin/chrootshell, e.g.
fred2:x:1000:1000:Fred Smith’s Jail:/home/fred/jail:/sbin/chrootshell

2.3. Create inner /etc/passwd entry
For each user created in the previous step, create a new /etc/passwd entry in the jail’s /etc/passwd. This entry should list a real home directory inside the jail, and a real shell, so it’ll be quite different from the one in the system /etc/passwd. For example:
fred2:x:1000:1000:Fred Smith In Jail:/home/fred2:/bin/sh

2.4. Test and Troubleshoot Jail Login
Try logging in as a jailed user via telnet, e.g.
$ telnet localhost
username: fred2
password: xxxx
$ ls

When you do an ls, you should see just the files in that user’s home directory inside the jail.
If telnet just closes the connection after login, first you need to figure out whether the problem is before or after chrootshell starts. Begin by listing the target system’s logfiles, e.g.

$ su
# cd /var/log
# ls -ltr

and examine the most recently change log files for telnet or pam permission problems. If you see any, it means you haven’t even gotten to chrootshell yet.
If it looks like chrootshell is started, but then aborts, you can turn on more helpful logging by uncommenting the line

/* #define DEBUG_PRINTS */

and recompiling and reinstalling chrootshell. This will cause verbose error messages to be sent directly to the file /var/log/chrootshell.log. The error messages tell you what line chrootshell.c is failing in, which should tell you enough to solve the problem.
3. Setting Up Berkeley r-utilities (rsh, rlogin, rcp)
For truly convenient (but insecure) access to the jail, you may want to set up the Berkely r-utilities (rsh, rlogin, and rcp). For instance, they are the native and probably preferred way of running the gcc and glibc test suites on remote embedded systems with tcp/ip connectivity.
Incidentally, rcp and rsh are the reason I used a C program for chrootshell rather than a shell script. A shell script version of chrootshell seems to fail to handle some of the remote commands

3.1. Installing r-utilities clients and servers
Beware: this is highly insecure!
To do this on a Red Hat or Debian Linux workstation, install the rsh-server and rsh-client packages. You can also do this by downloading, building, and installing ftp://ftp.gnu.org/gnu/inetutils/inetutils-1.4.2.tar.gz. (Don’t use version 1.4.1; the rshd in that version did not set up the remote environment properly.) For an embedded target, the only parts of inetutils you really need to build and install are rcp, rshd, and rlogind. (Yes, you need rcp even if you just want to run the server side.) sent by the gcc regression test due to quoting problems.

Once the binaries are installed on the target, you need to configure the system to run them.

rshd is run via inetd or xinetd or the like.
If your target system uses inetd, here’s an example line to add to /etc/inetd.conf:

shell stream tcp nowait.1000 root /bin/rshd

The “.1000” tells inetd to expect 1000 sessions per minute, which should be sufficient to handle the gcc regression tests. (If you leave this off, inetd will stop spawning rshd after about the 40th session in a one-minute period.)
If your target system uses xinetd, you probably need to set the “cps” field of /etc/xinetd.d/rsh to a large value such as “200 5” for xinet to handle the expected traffic. I haven’t tested this, but here’s what I think /etc/xinetd.d/rsh should look like:

service login
{
socket_type = stream
wait = no
cps = 200 5
user = root
server = /usr/sbin/in.rlogind
disable = no
}

rlogind is normally run standalone by giving the -d option. For instance, add this line to the target system’s startup scripts:

/bin/rlogind -d

If you want to allow remote access by root (which is highly insecure, but useful in limited situations, as you’ll see below), add the -o option.
3.2. Opening up a security hole for the r-utilities
If your systems use a firewall, you’ll need to open up TCP ports 513 (the ‘login’ service) and 514 (the ‘shell’ service). Note that this is a highly insecure thing to do, and should only be done inside a network entirely disconnected from the Internet or any large LAN, or at least well-shielded from it by another firewall.
If your system is using PAM, you may need to add entries in /etc/pam.d for rsh and rlogin. (If you installed the rsh-servers package, it probably added these entries for you.)

You probably need to add entries either to the systemwide /etc/hosts.equiv file (see ‘man hosts.equiv’) or the per-user .rhosts file (see ‘man rhosts’).

3.2. Testing and Troubleshooting rsh
First, verify you can run ‘ls’ remotely without the jail. Assuming the target system’s hostname is ‘target’, and assuming you have an account with the same name on both systems, give the following command on your workstation:
$ rsh target ls

If this doesn’t work, look at the most recently modified files in the target’s /var/log directory for interesting error messages. If that doesn’t explain the problem, try running tcpdump and looking at the packets being sent and received. If that doesn’t explain the problem, you can run rshd under strace to see what files it’s opening — maybe it’s not looking where you expected for security info. (This requires modifying /etc/inetd.conf to invoke strace -o /tmp/strace.log /bin/rshd instead of just /bin/rshd.) When all else fails, you could build rsh from source and step through it in the debugger.
Once that’s working, verify that you can spawn 200 commands in quick succession, e.g.

x=0; while test $x -lt 200; do rsh 11.3.4.1 ls; x=$(($x+1)); echo $x; done

and once that works, really overstress the system by verifying you can spawn 200 overlapping commands, e.g.
x=0; while test $x -lt 200; do rsh -n 11.3.1.1 ls & true ; x=$(($x+1)); echo $x; done

If you get the error
poll: protocol failure in circuit setup

around the 40th iteration, then you may need to edit /etc/inetd.conf and add .1000 to the end of the ‘nowait’ keyword (or edit /etc/xinit.d/rsh and add the cps parameter…) as described above.
3.3. Testing rlogin
Verify you can get a remote shell with either the command ‘rlogin target’ (or its synonym ‘rsh target’; rsh invokes rlogin if it’s run without a command).
3.4. Testing rcp
Verify you can copy files to the target system using rcp, e.g.
rcp /etc/hosts target:

4. Setting up chroot jails remotely
The entire reason I wrote this document was to help me achieve the goal of fully automated build-and-test of crosstoolchains. This requires a way to remotely script replacing the contents of a chroot jail with new system libraries. Once a jail has been set up for the user you plan to run the remote tests as, and it has passed all the above tests, you should be able to blow away the old jail’s contents, and recreate the jail remotely filled with the system shared libraries you plan to test. For example, assuming the toolchain you want to test is in /opt/my_toolchain, and the user ‘jailuser’ is set up to be inside the jail on the target, you can rebuild it with the following commands:
$ TARGET=11.3.4.1
$ rcp $JAILUSER@$TARGET:/jail/etc/passwd jail_etc_passwd
$ sh mkjail.sh result/sh4-unknown-linux-gnu/gcc-3.3-glibc-2.2.5/sh4-unknown-linux-gnu jail_etc_passwd
$ rcp initjail.sh root@$TARGET:
$ cat jail.tar.gz | rsh -l root $TARGET /bin/sh initjail.sh /jail

Then test the jail to make sure it still works, e.g.
$ rsh -l jailuser target ls
$ rcp /etc/hosts jailuser@target:/tmp

Summary
This document has shown how to set up a target system to allow remote login and file copy into accounts running in a chroot jail. If everything described here works properly, you should be all set to Run gcc regression tests remotely on the target system.
About this document
Ideas and code snippets taken variously from:
the very similar http://www.tjw.org/chroot-login-HOWTO, by Terry J. White and Brian Rhodes.
SVR4’s login’s feature whereby a “*” in the shell field of /etc/passwd triggered a chroot, http://www.mcsr.olemiss.edu/cgi-bin/man-cgi?login+1
Martin P. Ibert’s 1993 post, http://groups.google.com/groups?selm=HNLAB98U%40heaven7.in-berlin.de
Ulf Bartelt’s 1994 post, http://groups.google.com/groups?selm=1994Jun5.144526.9091%40solaris.rz.tu-clausthal.de
Tony White’s chroot-login-HOWTO from 2001, http://www.tjw.org/chroot-login-HOWTO/
Mike Makonnen’s June 2003 post, http://groups.google.com/groups?selm=bbeuh2%2416j0%241%40FreeBSD.csie.NCTU.edu.tw

Solaris Fault Management

Solaris Fault Management

The Solaris Fault Management Facility is designed to be integrated into the Service Management Facility to provide a self-healing capability to Solaris 10 systems.

The fmd daemon is responsible for monitoring several aspects of system health.

The fmadm config command shows the current configuration for fmd.

The Fault Manager logs can be viewed with fmdump -v and fmdump -e -v.

fmadm faulty will list any devices flagged as faulty.

fmstat shows statistics gathered by fmd.

Fault Management

With Solaris 10, Sun has implemented a daemon, fmd, to track and react to fault management. In addition to sending traditional syslog messages, the system sends binary telemetry events to fmd for correlation and analysis. Solaris 10 implements default fault management operations for several pieces of hardware in Sparc systems, including CPU, memory, and I/O bus events. Similar capabilities are being implemented for x64 systems.

Once the problem is defined, failing components may be offlined automatically without a system crash, or other corrective action may be taken by fmd. If a service dies as a result of the fault, the Service Management Facility (SMF) will attempt to restart it and any dependent processes.

The Fault Management Facility reports error messages in a well-defined and explicit format. Each error code is uniquely specified by a Universal Unique Identifier (UUID) related to a document on the Sun web site athttp://www.sun.com/msg/ .

Resources are uniquely identified by a Fault Managed Resource Identifier (FMRI). Each Field Replaceable Unit (FRU) has its own FMRI. FMRIs are associated with one of the following conditions:

  • ok: Present and available for use.
  • unknown: Not present or not usable, perhaps because it has been offlined or unconfigured.
  • degraded: Present and usable, but one or more problems have been identified.
  • faulted: Present but not usable; unrecoverable problems have been diagnosed and the resource has been disabled to prevent damage to the system.

The fmdump -V -u eventid command can be used to pull information on the type and location of the event. (The eventid is included in the text of the error message provided to syslog.) The -e option can be used to pull error log information rather than fault log information.

Statistical information on the performance of fmd can be viewed via the fmstatcommand. In particular, fmstat -m modulename provides information for a given module.

The fmadm command provides administrative support for the Fault Management Facility. It allows us to load and upload modules and view and update the resource cache. The most useful capabilities of fmadm are provided through the following subcommands:

  • config: Display the configuration of component modules.
  • faulty: Display faulted resources. With the -a option, list cached resource information. With the -i option, list persistent cache identifier information, instead of most recent state and UUID.
  • load /path/module: Load the module.
  • unload module: Unload module; the module name is the same as reported by fmadm config.
  • rotate logfile: Schedule rotation for the specified log file. Used with the logadm configuration file.

Additional Resources

Amy Rich’s Predictive Self-Healing Article

Mike Shapiro’s magazine article and presentation contain a good discussion of the architectural underpinnings of the Fault Manager.

Gavin Maltby reports on AMD fault management in this blog entry.

matty’s blog has a short introduction to Fault Management on Solaris.