LNXHC - Linux Health Checker

Health checks

  1. boot_runlevel_recommended
  2. cpu_capacity
  3. crypto_cca_stack
  4. crypto_cpacf
  5. crypto_opencryptoki_ckc
  6. crypto_opencryptoki_ckc_32bit
  7. crypto_opencryptoki_skc
  8. crypto_opencryptoki_skc_32bit
  9. crypto_openssl_ibmca_config
  10. crypto_openssl_stack
  11. crypto_openssl_stack_32bit
  12. crypto_z_module_loaded
  13. css_ccw_blacklist
  14. css_ccw_chpid_status
  15. css_ccw_device_availability
  16. css_ccw_device_usage
  17. css_ccw_driver_association
  18. fc_remote_port_state
  19. fs_boot_zipl_bootmap
  20. fs_fstab_dasd_devnodes
  21. fs_fstab_fsck_order
  22. fs_inode_usage
  23. fs_mount_option_ro
  24. fs_tmp_cleanup
  25. fs_usage
  26. fw_callhome
  27. fw_cpi
  28. log_syslog_rotate
  29. mem_swap_availability
  30. mem_usage
  31. net_bond_ineffective
  32. net_bond_qeth_ineffective
  33. net_dns_settings
  34. net_hsi_outbound_errors
  35. net_inbound_errors
  36. net_qeth_buffercount
  37. net_services_insecure
  38. proc_cpu_usage
  39. proc_load_avg
  40. proc_mem_oom_triggered
  41. proc_mem_usage
  42. proc_priv_dump
  43. ras_dump_kdump_on_panic
  44. ras_dump_on_panic
  45. ras_panic_on_oops
  46. scsi_dev_state
  47. sec_tty_root_login
  48. sec_users_uid_zero
  49. storage_dasd_cdl_part
  50. storage_dasd_eckd_blksize
  51. storage_dasd_nopav_zvm
  52. storage_dasd_pav_aliases
  53. storage_mp_ineffective
  54. storage_mp_path_state
  55. storage_mp_service_active
  56. storage_mp_zfcp_redundancy
  57. tty_console_getty
  58. tty_console_log_level
  59. tty_devnodes
  60. tty_hvc_iucv
  61. tty_idle_terminals
  62. tty_idle_users
  63. tty_usage
  64. zfcp_hba_npiv_active
  65. zfcp_hba_recovery_failed
  66. zfcp_hba_shared_chpids
  67. zfcp_lun_configured_available
  68. zfcp_lun_recovery_failed
  69. zfcp_target_port_recovery_failed
  70. zvm_priv_class

1. Health check "boot_runlevel_recommended" (back to top)

Component

boot

Title

Check whether the recommended runlevel is used and set as default

Description

Running Linux with an unsuitable runlevel can mean that required services are not available, or it can mean that unnecessary processes degrade performance or security.

Linux runlevels are usually expressed as integers in the range 0 to 6, where 0 and 6 are reserved for halt and reboot. The meaning of runlevels 1 to 5 differ between distributions. See the "init" man page of your distribution for details.

Dependencies

(sys_distro=RHEL and sys_rhel_version>=5.0) or (sys_distro=SLES and sys_sles_version>=10)

Authors

Rajesh K Pirati <rapirati@in.ibm.com>

Parameter "recommended_runlevel"

Description

The recommended runlevel for the Linux instance. Valid values are integers in the range 1 to 5.

Default value

3

Exception "current_runlevel_differs"

Severity

medium

Summary

The current runlevel (&current_runlevel;) does not match the recommended runlevel (&param_recommended_runlevel;)

Explanation

The recommended runlevel for the Linux instance is &param_recommended_runlevel;, but currently runlevel &current_runlevel; is used.

Linux runlevels define which processes can run. Linux runlevels are usually expressed as integers in the range 0 to 6. Runlevels 0 and 6 are reserved for halt and reboot. The meaning of runlevels 1 to 5 differs between distributions.

Solution

To temporarily change the current run level, use the "init" command. For example, to change the runlevel to 3 issue:

init 3

To change the runlevel that is used after booting Linux, change the default runlevel in /etc/inittab. For example, to change the default runlevel from 5 to 3 change the line

id:5:initdefault

to

id:3:initdefault

In this line, the entry identifier before the first colon depends on the distribution and need not be "id" as shown in the example.

If Linux uses the correct runlevel, adjust the "recommended_runlevel" check parameter accordingly to prevent this warning in the future.

Reference

For information about the available runlevels and about changing the current runlevel, see "init" man page.

Exception "default_runlevel_differs"

Severity

medium

Summary

The default runlevel (&default_runlevel;) does not match the recommended runlevel (&param_recommended_runlevel;)

Explanation

The recommended runlevel for the Linux instance is &param_recommended_runlevel;, but currently runlevel &default_runlevel; is set as default.

Linux runlevels define which processes can run. Linux runlevels are usually expressed as integers in the range 0 to 6. Runlevels 0 and 6 are reserved for halt and reboot. The meaning of runlevels 1 to 5 differs between distributions.

Solution

To change the runlevel that is used after booting Linux, change the default runlevel in /etc/inittab. For example, to change the default runlevel from 5 to 3 change the line

id:5:initdefault

to

id:3:initdefault

In this line, the entry identifier before the first colon depends on the distribution and need not be "id" as shown in the example.

If Linux uses the correct runlevel, adjust the "recommended_runlevel" check parameter accordingly to prevent this warning in the future.

Reference

For information about the available runlevels and about changing the current runlevel, see "init" man page.

2. Health check "cpu_capacity" (back to top)

Component

cpu

Title

Check whether the CPUs run with reduced capacity

Description

External events or reconfigurations might cause CPUs to run with reduced capacity. This check examines the CPU capacity-adjustment indication and capacity-change reason codes of the System z mainframe.

Dependencies

sys_platform=s390 or sys_platform=s390x

Authors

Hendrik Brueckner <brueckner@linux.vnet.ibm.com>

Parameter "acceptable_cap_adj"

Description

The lowest acceptable CPU capacity-adjustment indication. The default value is 100, for regular capacity. Lower values indicate reduced capacity. An exception is raised if the System z mainframe reports a capacity-adjustment indication below this value.

Change this value only if your System z mainframe intentionally runs with reduced capacity, for example, in power-saving mode. Valid values are integers in the range 1 to 100.

Default value

100

Parameter "expected_cap_rs"

Description

The expected capacity-change reason. The default value is 0, for regular operations without capacity changes. An exception is raised if the System z mainframe reports a capacity-change reason other than this value.

Change this value to 1 if your System z mainframe runs in power-saving mode.

Default value

0

Exception "capacity_reduced"

Severity

high

Summary

The System z mainframe runs with reduced capacity

Explanation

The CPUs of the System z mainframe run with reduced capacity, for example, in power-saving mode. This affects all Linux and other operating system instances running on this particular System z hardware.

Apart from intentional configuration for power-saving, reduced capacity can result from, for example, overheating of the hardware.

The lowest acceptable capacity-adjustment indication is "&param_acceptable_cap_adj;" with a capacity-change reason "&param_expected_cap_rs;".

The current capacity-adjustment indication is "&cap_adj_ind;" with a capacity-change reason "&cap_ch_rs;".

You can find the current capacity-adjustment values in /proc/sysinfo. Look for the "Capacity Adj. Ind." and "Capacity Ch. Reason" entries.

Solution

If your System z mainframe intentionally runs with reduced capacity, adjust the check parameters accordingly to prevent this check from raising further exceptions. In other cases, contact your hardware support team.

Reference

For information about the capacity change reason codes see the "Principles of Operation" for your System z mainframe.

3. Health check "crypto_cca_stack" (back to top)

Component

crypto/cca

Title

Verify the availability of System z cryptographic hardware support through a Common Cryptographic Architecture (CCA) stack

Description

Applications using cryptographic operations by linking to CCA libraries can only exploit System z cryptography hardware if the CCA stack is configured correctly.

Cryptographic coprocessor adapters must be available. Prerequisites for a well configured CCA stack that uses System z cryptographic hardware is a device driver to exploit cryptographic adapters, and the csulcca library.

The CCA stack is required for secure key cryptography operations.

This health check verifies that:

  • The Cryptographic Coprocessor is available

  • Required RPMs, such as 'csulcca', are available

Dependencies

sys_platform=s390x

(sys_distro=RHEL and sys_rhel_version>=5.4) or (sys_distro=SLES and sys_sles_version>=11)

Authors

Nageswara R Sastry <nasastry@in.ibm.com>

Exception "crypto_coprocessors_not_available"

Severity

high

Summary

Required Cryptographic Coprocessor is not available

Explanation

The required Cryptographic Coprocessor is not available. Secure key cryptographic functions and true random number generation are not supported.

To verify whether the required Cryptographic Coprocessor is available, issue:

lszcrypt

and look for the following line in the output:

card<nn>: CEX?C

where the question mark (?) denotes the series of the Cryptographic Coprocessor, for example 2 or 3.

Solution

If the Cryptogtaphic Coprocessor is not attached to the Linux system, follow the procedure described in the 'Technical Guide' related to your System z. For example, for a System z server z196, refer to the 'IBM zEnterprise 196 Technical Guide'.

If the Cryptographic Coprocessor is attached but not online, you can use the 'chzcrypt' tool to set Coprocessors online. See the 'Generic cryptographic device driver' chapter in 'Device Drivers, Features, and Commands' .

Reference

You can obtain Technical Guides from http://www.redbooks.ibm.com

You can obtain the 'Device Drivers, Features, and Commands' publication from http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

Exception "rpms_not_installed"

Severity

high

Summary

Required RPMs are not installed (&rpm_summ;)

Explanation

Required RPMs are not installed. The Linux system cannot exploit the cryptographic hardware. Some applications or libraries will emulate cryptographic operations in software instead, but this emulation will decrease the system performance exceedingly.

The following RPMs are required but not installed: &rpm;

To verify whether the required RPMs are installed, issue:

rpm -qa | grep "<RPMname>"

where <RPMname> is the name of a required RPM

Examples:

In case of a single RPM that contains the string 'avahi', issue:

rpm -qa | grep "avahi"

In case of a list of RPMs that contain a string 'avahi' and 'postfix', issue:

rpm -qa | grep "avahi|postfix"

Solution

Install the required RPMs using the following command or by using the specific options available from your distribution.

rpm -ivh <RPMname> [<RPMname> ...]

where <RPMname> is the name of the required RPM

You can download CCA RPMs from http://www.ibm.com/security/cryptocards/pciecc/ordersoftware.shtml

Reference

See the man page of the 'rpm' command.

4. Health check "crypto_cpacf" (back to top)

Component

crypto/cpacf

Title

Confirm that CPACF is enabled

Description

The CP Assist for Cryptographic Functions (CPACF) accelerates symmetric cryptographic algorithms. This check verifies that CPACF is enabled on the system.

CPACF is a mandatory prerequisite for hardware-based acceleration of cryptographic operations in the following contexts:

  • the OpenSSL software stack (see also checks crypto_openssl_stack and crypto_openssl_stack_32bit)

  • the clear key openCryptoki (PKCS#11) software stack (see also checks crypto_opencryptoki_ckc and crypto_opencryptoki_ckc_32bit)

  • Linux kernel-internal cryptographic operations, such as dm-crypt and IPSec

CPACF is also required for the availability of the /dev/prng pseudo random number generator device.

Optionally CPACF enables protected key operation for the Common Cryptographic Architecture (CCA) software stack (see also checks crypto_cca_stack, crypto_opencryptoki_skc, and opencryptoki_skc_32bit).

Dependencies

sys_platform=s390 or sys_platform=s390x

Authors

Hendrik Brueckner <brueckner@linux.vnet.ibm.com>

Exception "cpacf_not_enabled"

Severity

medium

Summary

CPACF is not enabled

Explanation

The CP Assist for Cryptographic Functions (CPACF) feature is not enabled. As a result, hardware-based acceleration of cryptographic operations is not available in the OpenSSL stack, the clear key openCryptoki stack and in the Linux kernel. The following health checks will be not applicable:

  • crypto_openssl_stack and crypto_openssl_stack_32bit

  • crypto_opencryptoki_ckc and crypto_opencryptoki_ckc_32bit

In addition the usage of protected key cryptography is not available to CCA based secure key cryptography stacks (see health checks crypto_cca_stack, crypto_opencryptoki_skc, and crypto_opencryptoki_skc_32bit).

To establish whether CPACF has been enabled on your hardware, issue:

# cat /proc/cpuinfo

If CPACF has been enabled, the listed features include "msa".

Solution

CPACF is activated using a no-charge enablement feature, FC 3863.

Contact your System z support team for further assistance.

Reference

None.

5. Health check "crypto_opencryptoki_ckc" (back to top)

Component

crypto/opencryptoki

Title

Verify the availability of System z cryptographic hardware support for PKCS#11 clear key cryptographic operations

Description

Software that uses clear key cryptographic functions via openCryptoki (PKCS#11 API) can exploit System z cryptographic hardware if the openCryptoki clear key cryptographic stack is set up correctly. If the setup is incorrect or incomplete, Linux may in some cases emulate cryptographic operations by software. However this emulation will decrease system performance.

The openCryptoki clear key cryptographic stack comprises openCryptoki together with the ICA token, the libica, possibly the System z cryptography kernel device driver and access to system cryptographic hardware features like CPACF and cryptographic adapters.

This health check verifies that:

  • The Cryptographic hardware (coprocessor and/or accelerator adapters) is available

  • Required RPMs, such as 'openCryptoki', and 'libica' are available

  • openCryptoki is initialized

  • The ICA token is configured

Dependencies

sys_platform=s390x

(sys_distro=RHEL and sys_rhel_version>=5.4) or (sys_distro=SLES and sys_sles_version>=11)

Authors

Nageswara R Sastry <nasastry@in.ibm.com>

Exception "crypto_adapters_not_available"

Severity

medium

Summary

Required cryptographic adapters are not available (&crypto_hw;)

Explanation

Required cryptographic adapters are not available. The Linux system will emulate cryptographic operations by software. This emulation will decrease the system performance exceedingly.

The following cryptographic adapters are required but not available: &crypto_hw;

To verify whether a required cryptographic adapter is available, issue:

lszcrypt

and look for the following line in the output:

card<nn>: CEXxy
where
x denotes the series of the cryptographic adapter, such as 2 or 3.
y denotes the type of the cryptographic adapter: 'C' for Coprocessor, and 'A' for Accelerator.

Solution

If the cryptographic adapter is not attached to the Linux system, follow the procedure described in the 'Technical Guide' related to your System z. For example, for a System z server z196, refer to the 'IBM zEnterprise 196 Technical Guide'.

If the cryptographic adapter is attached but not online, you can use the 'chzcrypt' tool to set coprocessors online, see the 'Generic cryptographic device driver' chapter in 'Device Drivers, Features, and Commands'.

Reference

You can obtain Technical Guides from http://www.redbooks.ibm.com

You can obtain the 'Device Drivers, Features, and Commands' publication from http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

Exception "ica_token_not_configured"

Severity

high

Summary

The ICA token is not configured, cryptographic hardware cannot be exploited for clear key cryptography

Explanation

The ICA token is not configured. The cryptographic hardware cannot be exploited for clear key cryptographic operations. Some applications or libraries will emulate cryptographic operations, such as encryption, and decryption in software instead, but this emulation will decrease the system performance exceedingly.

To verify the state of the ICA token, issue:

 pkcsconf -t

If the ICA token is configured correctly, the following details will be displayed for the token that has the Model attribute set to 'IBM ICA':

Flags: 0x44D (RNG|LOGIN_REQUIRED|USER_PIN_INITIALIZED|CLOCK_ON_TOKEN|TOKEN_INITIALIZED)
where
       'USER_PIN_INITIALIZED' means that the user password has been changed, which is mandatory.
       'TOKEN_INITIALIZED' means that the token has been initialized and is ready for usage.
       If the flag 'TOKEN_INITIALIZED' is not displayed,  you need to initialize the ICA token.
       If the flag 'SO_PIN_TO_BE_CHANGED' is displayed,  the Security Officer's (SO) password still has the default value and needs to be changed.
       If the flag 'USER_PIN_TO_BE_CHANGED' is displayed,  you need to change the user password.

Solution

To configure the ICA token:

  1. Find out in which slot the ICA token is available:

     pkcsconf -s
Example output:
Slot #1 Info
        Description: Linux <xxxxxxx> Linux (ICA)
Use this slot number in the following commands.
  1. Initialize the ICA token:

     pkcsconf -c <slot_number> -I
  2. Change the SO password:

     pkcsconf -c <slot_number> -P
    Note: The default SO password is '87654321'.
  3. Initialize the user password:

     pkcsconf -c <slot_number> -u
  4. Change the user password:

     pkcsconf -c <slot_number> -p

Reference

For more information about clear key cryptography, see http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP100647

For more information about cryptographic hardware, see http://www.ibm.com/security/cryptocards/

For more information about openCryptoki, see http://www.ibm.com/developerworks/linux/library/s-pkcs/

Exception "opencryptoki_not_initialized"

Severity

high

Summary

openCryptoki is not initialized

Explanation

openCryptoki is not initialized, that is, the 'pkcs11_startup' script has not been called, the 'pkcsslotd' daemon has not been started, or both. Without the initialization of openCryptoki, the utilization of cryptographic hardware for PKCS#11 clear key cryptographic operations is not supported. The Linux system will emulate cryptographic operations by software. This emulation will decrease the system performance exceedingly.

Script 'pkcs11_startup' detects available tokens from installed shared object libraries and writes corresponding records to the 'pk_config_data' file. 'pkcs11_startup' should be run each time a new token has been installed or uninstalled. Daemon 'pkcsslotd' manages PKCS#11 objects, such as the ICA token, for openCryptoki. 'pkcsslotd' uses the information from the 'pk_config_data' file for token initialization.

  1. To verify if the 'pkcs11_startup' script has been called,

    check if the file '/var/lib/opencryptoki/pk_config_data' exists.
    If the file does not exist, run the script.
  2. To verify if the 'pkcsslotd' daemon is running, issue:

     ps -elf | grep pkcsslotd | grep -v grep
    This command produces output only if the daemon is running.
    If no output is displayed, start the daemon.

Solution

To run the 'pkcs11_startup' script, issue

 pkcs11_startup

To start the 'pkcsslotd' daemon, issue

 /etc/init.d/pkcsslotd start

Reference

For more information, see the 'pkcsslotd', 'pkcs11_startup', and 'pk_config_data' man pages.

Exception "rpms_not_installed"

Severity

high

Summary

Required RPMs are not installed (&rpm_summ;)

Explanation

Required RPMs are not installed. The Linux system cannot exploit the cryptographic hardware. Some applications or libraries will emulate cryptographic operations in software instead, but this emulation will decrease the system performance exceedingly.

The following RPMs are required but not installed: &rpm;

To verify whether the required RPMs are installed, issue:

rpm -qa | grep "<RPMname>"

where <RPMname> is the name of a required RPM

Examples:

In case of a single RPM that contains the string 'avahi', issue:

rpm -qa | grep "avahi"

In case of a list of RPMs that contain a string 'avahi' and 'postfix', issue:

rpm -qa | grep "avahi|postfix"

Solution

Install the required RPMs using the following command or by using the specific options available from your distribution.

rpm -ivh <RPMname> [<RPMname> ...]

where <RPMname> is the name of the required RPM

Reference

See the man page of the 'rpm' command.

6. Health check "crypto_opencryptoki_ckc_32bit" (back to top)

Component

crypto/opencryptoki

Title

Verify the availability of System z cryptographic hardware support for PKCS#11 clear key cryptographic operations

Description

Software that uses clear key cryptographic functions via openCryptoki (PKCS#11 API) can exploit System z cryptographic hardware if the openCryptoki clear key cryptographic stack is set up correctly. If the setup is incorrect or incomplete, Linux may in some cases emulate cryptographic operations by software. However this emulation will decrease system performance.

The openCryptoki clear key cryptographic stack comprises openCryptoki together with the ICA token, the libica, possibly the System z cryptography kernel device driver and access to system cryptographic hardware features like CPACF and cryptographic adapters.

This health check verifies that:

  • The Cryptographic hardware (coprocessor and/or accelerator adapters) is available

  • Required RPMs, such as 'openCryptoki', and 'libica' are available

  • openCryptoki is initialized

  • The ICA token is configured

Dependencies

sys_platform=s390x

(sys_distro=RHEL and sys_rhel_version>=5.4) or (sys_distro=SLES and sys_sles_version>=11)

Authors

Nageswara R Sastry <nasastry@in.ibm.com>

Exception "32bit_rpms_not_installed"

Severity

high

Summary

Required 32bit RPMs are not installed (&rpm_summ;)

Explanation

Required 32bit RPMs are not installed. The Linux system cannot exploit the cryptographic hardware. Some applications or libraries will emulate cryptographic operations in software instead, but this emulation will decrease the system performance exceedingly.

The following 32bit RPMs are required but not installed: &rpm;

To verify whether the required RPMs are installed, issue:

rpm -qa | grep "<RPMname>"

where <RPMname> is the name of a required RPM

Examples:

In case of a single RPM that contains the string 'avahi', issue:

rpm -qa | grep "avahi"

In case of a list of RPMs that contain a string 'avahi' and 'postfix', issue:

rpm -qa | grep "avahi|postfix"

Solution

Install the required RPMs by using the following command or by using the specific options available from your distribution.

rpm -ivh <RPMname> [<RPMname> ...]

where <RPMname> is the name of the required RPM

Reference

See the man page of the 'rpm' command.

Exception "crypto_adapters_not_available"

Severity

medium

Summary

Required cryptographic adapters are not available (&crypto_hw;)

Explanation

Required cryptographic adapters are not available. The Linux system will emulate cryptographic operations by software. This emulation will decrease the system performance exceedingly.

The following cryptographic adapters are required but not available: &crypto_hw;

To verify whether a required cryptographic adapter is available, issue:

lszcrypt

and look for the following line in the output:

card<nn>: CEXxy
where
x denotes the series of the cryptographic adapter, such as 2 or 3.
y denotes the type of the cryptographic adapter: 'C' for Coprocessor, and 'A' for Accelerator.

Solution

If the cryptographic adapter is not attached to the Linux system, follow the procedure described in the 'Technical Guide' related to your System z. For example, for a System z server z196, refer to the 'IBM zEnterprise 196 Technical Guide'.

If the cryptographic adapter is attached but not online, you can use the 'chzcrypt' tool to set coprocessors online, see the 'Generic cryptographic device driver' chapter in 'Device Drivers, Features, and Commands'.

Reference

You can obtain Technical Guides from http://www.redbooks.ibm.com

You can obtain the 'Device Drivers, Features, and Commands' publication from http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

Exception "ica_token_not_configured"

Severity

high

Summary

The ICA token is not configured, cryptographic hardware cannot be exploited for clear key cryptography

Explanation

The ICA token is not configured. The cryptographic hardware cannot be exploited for clear key cryptographic operations. Some applications or libraries will emulate cryptographic operations, such as encryption, and decryption in software instead, but this emulation will decrease the system performance exceedingly.

To verify the state of the ICA token, issue:

 pkcsconf -t

If the ICA token is configured correctly, the following details will be displayed for the token that has the Model attribute set to 'IBM ICA':

Flags: 0x44D (RNG|LOGIN_REQUIRED|USER_PIN_INITIALIZED|CLOCK_ON_TOKEN|TOKEN_INITIALIZED)
where
       'USER_PIN_INITIALIZED' means that the user password has been changed, which is mandatory.
       'TOKEN_INITIALIZED' means that the token has been initialized and is ready for usage.
       If the flag 'TOKEN_INITIALIZED' is not displayed,  you need to initialize the ICA token.
       If the flag 'SO_PIN_TO_BE_CHANGED' is displayed,  the Security Officer's (SO) password still has the default value and needs to be changed.
       If the flag 'USER_PIN_TO_BE_CHANGED' is displayed,  you need to change the user password.

Solution

To configure the ICA token:

  1. Find out in which slot the ICA token is available:

     pkcsconf -s
Example output:
Slot #1 Info
        Description: Linux <xxxxxxx> Linux (ICA)
Use this slot number in the following commands.
  1. Initialize the ICA token:

     pkcsconf -c <slot_number> -I
  2. Change the SO password:

     pkcsconf -c <slot_number> -P
    Note: The default SO password is '87654321'.
  3. Initialize the user password:

     pkcsconf -c <slot_number> -u
  4. Change the user password:

     pkcsconf -c <slot_number> -p

Reference

For more information about clear key cryptography, see http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP100647

For more information about cryptographic hardware, see http://www.ibm.com/security/cryptocards/

For more information about openCryptoki, see http://www.ibm.com/developerworks/linux/library/s-pkcs/

Exception "opencryptoki_not_initialized"

Severity

high

Summary

openCryptoki is not initialized

Explanation

openCryptoki is not initialized, that is, the 'pkcs11_startup' script has not been called, the 'pkcsslotd' daemon has not been started, or both. Without the initialization of openCryptoki, the utilization of cryptographic hardware for PKCS#11 clear key cryptographic operations is not supported. The Linux system will emulate cryptographic operations by software. This emulation will decrease the system performance exceedingly.

Script 'pkcs11_startup' detects available tokens from installed shared object libraries and writes corresponding records to the 'pk_config_data' file. 'pkcs11_startup' should be run each time a new token has been installed or uninstalled. Daemon 'pkcsslotd' manages PKCS#11 objects, such as the ICA token, for openCryptoki. 'pkcsslotd' uses the information from the 'pk_config_data' file for token initialization.

  1. To verify if the 'pkcs11_startup' script has been called,

    check if the file '/var/lib/opencryptoki/pk_config_data' exists.
    If the file does not exist, run the script.
  2. To verify if the 'pkcsslotd' daemon is running, issue:

     ps -elf | grep pkcsslotd | grep -v grep
    This command produces output only if the daemon is running.
    If no output is displayed, start the daemon.

Solution

To run the 'pkcs11_startup' script, issue

 pkcs11_startup

To start the 'pkcsslotd' daemon, issue

 /etc/init.d/pkcsslotd start

Reference

For more information, see the 'pkcsslotd', 'pkcs11_startup', and 'pk_config_data' man pages.

7. Health check "crypto_opencryptoki_skc" (back to top)

Component

crypto/opencryptoki

Title

Verify the availability of System z cryptographic hardware support for PKCS#11 secure key cryptographic operations

Description

Secure key cryptographic operations require a Cryptographic Coprocessor. In order to use secure key cryptography via the Public Key Cryptographic Standard 11 (PKCS#11) API, openCryptoki together with the CCA token must be installed and configured correctly.

This health check verifies that:

  • The Cryptographic Coprocessor is available

  • Required RPMs, such as 'openCryptoki', and 'csulcca' are available

  • openCryptoki is initialized

  • The CCA token is configured

Dependencies

sys_platform=s390x

(sys_distro=RHEL and sys_rhel_version>=5.4) or (sys_distro=SLES and sys_sles_version>=11)

Authors

Nageswara R Sastry <nasastry@in.ibm.com>

Exception "cca_token_not_configured"

Severity

high

Summary

The CCA token is not configured, cryptographic hardware cannot be exploited for secure key cryptography

Explanation

The CCA token is not configured. The cryptographic hardware cannot be exploited for secure key cryptographic operations.

To verify the state of the CCA token, issue:

 pkcsconf -t
If the CCA token is configured correctly, the following details will be displayed for the token that has the Model attribute set to 'IBM CCA':
Flags: 0x44D (RNG|LOGIN_REQUIRED|USER_PIN_INITIALIZED|CLOCK_ON_TOKEN|TOKEN_INITIALIZED)
where
        'USER_PIN_INITIALIZED' means that the user password has been changed, which is mandatory.
        'TOKEN_INITIALIZED' means that the token has been initialized and is ready for usage.
        If the flag 'TOKEN_INITIALIZED' is not displayed, you need to initialize the CCA token.
        If the flag 'SO_PIN_TO_BE_CHANGED' is displayed, the Security Officer's (SO) password still has the default value and needs to be changed.
        If the flag 'USER_PIN_TO_BE_CHANGED' is displayed, you need to change the user password.

Solution

To configure the CCA token:

  1. Find out in which slot the CCA token is available:

     pkcsconf -s
    Example output:
    Slot #1 Info
          Description: Linux <xxxxxxx> Linux (CCA)
    Use this slot number in the following commands.
  2. Initialize the CCA token:

     pkcsconf -c <slot_number> -I
  3. Change the SO password:

     pkcsconf -c <slot_number> -P
    Note: The default SO password is '87654321'.
  4. Initialize the user password:

     pkcsconf -c <slot_number> -u
  5. Change the user password:

     pkcsconf -c <slot_number> -p

Reference

For more information about secure key cryptography, see http://www.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP100647

For more information about cryptographic hardware, see http://www.ibm.com/security/cryptocards/

For more information about openCryptoki, see http://www.ibm.com/developerworks/linux/library/s-pkcs/

Exception "crypto_coprocessors_not_available"

Severity

high

Summary

Required Cryptographic Coprocessor is not available

Explanation

The required Cryptographic Coprocessor is not available. Secure key cryptographic functions and true random number generation are not supported.

To verify whether the required Cryptographic Coprocessor is available, issue:

lszcrypt

and look for the following line in the output:

card<nn>: CEX?C

where the question mark (?) denotes the series of the Cryptographic Coprocessor, for example 2 or 3.

Solution

If the Cryptogtaphic Coprocessor is not attached to the Linux system, follow the procedure described in the 'Technical Guide' related to your System z. For example, for a System z server z196, refer to the 'IBM zEnterprise 196 Technical Guide'.

If the Cryptographic Coprocessor is attached but not online, you can use the 'chzcrypt' tool to set Coprocessors online. See the 'Generic cryptographic device driver' chapter in 'Device Drivers, Features, and Commands' .

Reference

You can obtain Technical Guides from http://www.redbooks.ibm.com

You can obtain the 'Device Drivers, Features, and Commands' publication from http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

Exception "opencryptoki_not_initialized"

Severity

high

Summary

openCryptoki is not initialized

Explanation

openCryptoki is not initialized, that is, the 'pkcs11_startup' script has not been called, the 'pkcsslotd' daemon has not been started, or both. Without the initialization of openCryptoki, the utilization of cryptographic hardware for PKCS#11 clear key cryptographic operations is not supported. The Linux system will emulate cryptographic operations by software. This emulation will decrease the system performance exceedingly.

Script 'pkcs11_startup' detects available tokens from installed shared object libraries and writes corresponding records to the 'pk_config_data' file. 'pkcs11_startup' should be run each time a new token has been installed or uninstalled. Daemon 'pkcsslotd' manages PKCS#11 objects, such as the ICA token, for openCryptoki. 'pkcsslotd' uses the information from the 'pk_config_data' file for token initialization.

  1. To verify if the 'pkcs11_startup' script has been called,

    check if the file '/var/lib/opencryptoki/pk_config_data' exists.
    If the file does not exist, run the script.
  2. To verify if the 'pkcsslotd' daemon is running, issue:

     ps -elf | grep pkcsslotd | grep -v grep
    This command produces output only if the daemon is running.
    If no output is displayed, start the daemon.

Solution

To run the 'pkcs11_startup' script, issue

 pkcs11_startup

To start the 'pkcsslotd' daemon, issue

 /etc/init.d/pkcsslotd start

Reference

For more information, see the 'pkcsslotd', 'pkcs11_startup', and 'pk_config_data' man pages.

Exception "rpms_not_installed"

Severity

high

Summary

Required RPMs are not installed (&rpm_summ;)

Explanation

Required RPMs are not installed. When required RPMs are not installed, the Linux system cannot exploit the cryptographic hardware. Some applications or libraries will emulate cryptographic operations in software instead, but this emulation will decrease the system performance exceedingly.

The following RPMs are required but not installed: &rpm;

To verify whether the required RPMs are installed, issue:

rpm -qa | grep "<RPMname>"

where <RPMname> is the name of a required RPM

Examples:

In case of a single RPM that contains the string 'avahi', issue:

rpm -qa | grep "avahi"

In case of a list of RPMs that contain a string 'avahi' and 'postfix', issue:

rpm -qa | grep "avahi|postfix"

Solution

Install the required RPMs by using the following command or by using the specific options available from your distribution.

rpm -ivh <RPMname> [<RPMname> ...]

where <RPMname> is the name of the required RPM

You can download CCA RPMs from http://www.ibm.com/security/cryptocards/pciecc/ordersoftware.shtml

Reference

See the man page of the 'rpm' command.

8. Health check "crypto_opencryptoki_skc_32bit" (back to top)

Component

crypto/opencryptoki

Title

Verify the availability of System z cryptographic hardware support for PKCS#11 secure key cryptographic operations

Description

Secure key cryptographic operations require a Cryptographic Coprocessor. In order to use secure key cryptography via the Public Key Cryptographic Standard 11 (PKCS#11) API, openCryptoki together with the CCA token must be installed and configured correctly.

This health check verifies that:

  • The Cryptographic Coprocessor is available

  • Required RPMs, such as 'openCryptoki', and 'csulcca' are available

  • openCryptoki is initialized

  • The CCA token is configured

Dependencies

sys_platform=s390x

(sys_distro=RHEL and sys_rhel_version>=5.4) or (sys_distro=SLES and sys_sles_version>=11)

Authors

Nageswara R Sastry <nasastry@in.ibm.com>

Exception "32bit_rpms_not_installed"

Severity

high

Summary

Required 32bit RPMs are not installed (&rpm_summ;)

Explanation

Required 32bit RPMs are not installed. When required RPMs are not installed, the Linux system cannot exploit the cryptographic hardware. Some applications or libraries will emulate cryptographic operations in software instead, but this emulation will decrease the system performance exceedingly.

The following 32bit RPMs are required but not installed: &rpm;

To verify whether the required RPMs are installed, issue:

rpm -qa | grep "<RPMname>"

where <RPMname> is the name of a required RPM

Examples:

In case of a single RPM that contains the string 'avahi', issue:

rpm -qa | grep "avahi"

In case of a list of RPMs that contain a string 'avahi' and 'postfix', issue:

rpm -qa | grep "avahi|postfix"

Solution

Install the required RPMs by using the following command or by using the specific options available from your distribution.

rpm -ivh <RPMname> [<RPMname> ...]

where <RPMname> is the name of the required RPM

You can download CCA RPMs from http://www.ibm.com/security/cryptocards/pciecc/ordersoftware.shtml

Reference

See the man page of the 'rpm' command.

Exception "cca_token_not_configured"

Severity

high

Summary

The CCA token is not configured, cryptographic hardware cannot be exploited for secure key cryptography

Explanation

The CCA token is not configured. The cryptographic hardware cannot be exploited for secure key cryptographic operations.

To verify the state of the CCA token, issue:

 pkcsconf -t
If the CCA token is configured correctly, the following details will be displayed for the token that has the Model attribute set to 'IBM CCA':
Flags: 0x44D (RNG|LOGIN_REQUIRED|USER_PIN_INITIALIZED|CLOCK_ON_TOKEN|TOKEN_INITIALIZED)
where
        'USER_PIN_INITIALIZED' means that the user password has been changed, which is mandatory.
        'TOKEN_INITIALIZED' means that the token has been initialized and is ready for usage.
        If the flag 'TOKEN_INITIALIZED' is not displayed, you need to initialize the CCA token.
        If the flag 'SO_PIN_TO_BE_CHANGED' is displayed, the Security Officer's (SO) password still has the default value and needs to be changed.
        If the flag 'USER_PIN_TO_BE_CHANGED' is displayed, you need to change the user password.

Solution

To configure the CCA token:

  1. Find out in which slot the CCA token is available:

     pkcsconf -s
    Example output:
    Slot #1 Info
          Description: Linux <xxxxxxx> Linux (CCA)
    Use this slot number in the following commands.
  2. Initialize the CCA token:

     pkcsconf -c <slot_number> -I
  3. Change the SO password:

     pkcsconf -c <slot_number> -P

    @Note: The default SO password is '87654321'.

  4. Initialize the user password:

     pkcsconf -c <slot_number> -u
  5. Change the user password:

     pkcsconf -c <slot_number> -p

Reference

For more information about secure key cryptography, see http://www.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP100647

For more information about cryptographic hardware, see http://www.ibm.com/security/cryptocards/

For more information about openCryptoki, see http://www.ibm.com/developerworks/linux/library/s-pkcs/

Exception "crypto_coprocessors_not_available"

Severity

high

Summary

Required Cryptographic Coprocessor is not available

Explanation

The required Cryptographic Coprocessor is not available. Secure key cryptographic functions and true random number generation are not supported.

To verify whether the required Cryptographic Coprocessor is available, issue:

lszcrypt

and look for the following line in the output:

card<nn>: CEX?C

where the question mark (?) denotes the series of the Cryptographic Coprocessor, for example 2 or 3.

Solution

If the Cryptogtaphic Coprocessor is not attached to the Linux system, follow the procedure described in the 'Technical Guide' related to your System z. For example, for a System z server z196, refer to the 'IBM zEnterprise 196 Technical Guide'.

If the Cryptographic Coprocessor is attached but not online, you can use the 'chzcrypt' tool to set Coprocessors online. See the 'Generic cryptographic device driver' chapter in 'Device Drivers, Features, and Commands' .

Reference

You can obtain Technical Guides from http://www.redbooks.ibm.com

You can obtain the 'Device Drivers, Features, and Commands' publication from http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

Exception "opencryptoki_not_initialized"

Severity

high

Summary

openCryptoki is not initialized

Explanation

openCryptoki is not initialized, that is, the 'pkcs11_startup' script has not been called, the 'pkcsslotd' daemon has not been started, or both. Without the initialization of openCryptoki, the utilization of cryptographic hardware for PKCS#11 clear key cryptographic operations is not supported. The Linux system will emulate cryptographic operations by software. This emulation will decrease the system performance exceedingly.

Script 'pkcs11_startup' detects available tokens from installed shared object libraries and writes corresponding records to the 'pk_config_data' file. 'pkcs11_startup' should be run each time a new token has been installed or uninstalled. Daemon 'pkcsslotd' manages PKCS#11 objects, such as the ICA token, for openCryptoki. 'pkcsslotd' uses the information from the 'pk_config_data' file for token initialization.

  1. To verify if the 'pkcs11_startup' script has been called,

    check if the file '/var/lib/opencryptoki/pk_config_data' exists.
    If the file does not exist, run the script.
  2. To verify if the 'pkcsslotd' daemon is running, issue:

     ps -elf | grep pkcsslotd | grep -v grep
    This command produces output only if the daemon is running.
    If no output is displayed, start the daemon.

Solution

To run the 'pkcs11_startup' script, issue

 pkcs11_startup

To start the 'pkcsslotd' daemon, issue

 /etc/init.d/pkcsslotd start

Reference

For more information, see the 'pkcsslotd', 'pkcs11_startup', and 'pk_config_data' man pages.

9. Health check "crypto_openssl_ibmca_config" (back to top)

Component

crypto/openssl

Title

Check whether the path to the OpenSSL library is configured correctly

Description

If the libibmca.so path is not specified correctly in the openssl.cnf configuration file, "ssh" commands fail. An incorrect specification can also prevent logins to Linux.

Dependencies

sys_platform=s390 or sys_platform=s390x

Authors

Nageswara R Sastry <nasastry@in.ibm.com>

Exception "so_file_path_not_correct"

Severity

high

Summary

The path to libibmca.so in &openssl_cnf_path; is not correct

Explanation

The &openssl_cnf_path; configuration file specifies the path to the libibmca.so library in the Linux file system. The libibmca.so library is not available at the specified location.

The current specification is:

dynamic_path  =  &libibmca_so_file_path_in_config_file;

The specification should be:

dynamic_path  =  &libibmca_so_file_path;

Solution

Open &openssl_cnf_path; with a text editor. Find the following line:

dynamic_path  =  &libibmca_so_file_path_in_config_file;

Changes this line to:

dynamic_path  =  &libibmca_so_file_path;

Reference

None.

10. Health check "crypto_openssl_stack" (back to top)

Component

crypto/openssl

Title

Verify the availability of System z cryptographic hardware support through an OpenSSL stack

Description

The applications using cryptographic operations by linking to OpenSSL libraries can exploit System z cryptographic hardware only if the OpenSSL stack is configured correctly.

The following cryptographic hardware can be exploited if available:

  • CPACF instructions in the CPU

  • Cryptographic Accelerator adapters

  • Cryptographic Coprocessor adapters

Prerequisites for a well configured OpenSSL stack that uses System z cryptographic hardware are:

  • The enablement of CPACF, a device driver to exploit cryptographic adapters (if adapters are available)

  • The libica library

  • The openssl-ibmca engine for OpenSSL being installed and configured

Configuring the OpenSSL stack to exploit System z cryptographic hardware accelerates applications using cryptographic functions and offloads CPU cycles to cryptographic adapters. The availability of cryptographic adapters is optional because libica provides a software fallback for the functions provided by the adapters.

This health check verifies that:

  • The Cryptographic Coprocessor or Accelerator is available

  • Required RPMs, such as 'openSSL', 'openssl-ibmca', and 'libica' are available

  • OpenSSL is configured with the 'ibmca' engine

Dependencies

sys_platform=s390x

(sys_distro=RHEL and sys_rhel_version>=5.4) or (sys_distro=SLES and sys_sles_version>=11)

Authors

Nageswara R Sastry <nasastry@in.ibm.com>

Exception "crypto_adapters_not_available"

Severity

medium

Summary

Required cryptographic adapters are not available (&crypto_hw;)

Explanation

Required cryptographic adapters are not available. The Linux system will emulate cryptographic operations by software. This emulation will decrease the system performance exceedingly.

The following cryptographic adapters are required but not available: &crypto_hw;

To verify whether a required cryptographic adapter is available, issue:

lszcrypt

and look for the following line in the output:

card<nn>: CEXxy
where
x denotes the series of the cryptographic adapter, such as 2 or 3
y denotes the type of the cryptographic adapter: 'C' for Coprocessor, and 'A' for Accelerator.

Solution

If the cryptographic adapter is not attached to the Linux system, follow the procedure described in the 'Technical Guide' related to your System z. For example, for a System z server z196, refer to the 'IBM zEnterprise 196 Technical Guide'.

If the cryptographic adapter is attached but not online, you can use the 'chzcrypt' tool to set coprocessors online. See the 'Generic cryptographic device driver' chapter in 'Device Drivers, Features, and Commands'.

Reference

You can obtain Technical Guides from http://www.redbooks.ibm.com

You can obtain the 'Device Drivers, Features, and Commands' publication from http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

Exception "ibmca_not_configured"

Severity

high

Summary

OpenSSL is not configured with the 'ibmca' engine

Explanation

OpenSSL is not configured with the 'ibmca' engine. The System z cryptographic hardware cannot be exploited. The cryptographic operations like encryption, or decryption will be emulated by software. This emulation will decrease the system performance exceedingly.

To verify the OpenSSL configuration, issue:

 openssl engine -c

If OpenSSL is configured with the 'ibmca' engine, details related to 'ibmca' will be displayed.

Solution

Configure the openssl.cnf file with the 'ibmca' engine. To know where the openssl.cnf file is located, issue:

 rpm -ql openssl | grep openssl.cnf

You will see a different path for each distribution. If you see two entries for the same file, one is for the 32-bit RPM.

To retrieve the required data for enabling the 'ibmca' engine, issue:

 rpm -ql openssl-ibmca

A list of openssl-ibmca files is displayed.

Open the sample configuration file 'openssl.cnf.sample-s390x' and verify if the 'libibmca.so' path is correct. Copy the contents of the sample file into the 'openssl.cnf' file.

Reference

See the man page of the 'openssl' and 'rpm' commands and the README file of the openssl-ibmca package.

Exception "rpms_not_installed"

Severity

high

Summary

Required RPMs are not installed (&rpm_summ;)

Explanation

Required RPMs are not installed. The Linux system cannot exploit the cryptographic hardware. Some applications or libraries will emulate cryptographic operations in software instead, but this emulation will decrease the system performance exceedingly.

The following RPMs are required but not installed: &rpm;

To verify whether the required RPMs are installed, issue:

rpm -qa | grep "<RPMname>"

where <RPMname> is the name of a required RPM

Examples:

In case of a single RPM that contains the string 'avahi', issue:

rpm -qa | grep "avahi"

In case of a list of RPMs that contain a string 'avahi' and 'postfix', issue:

rpm -qa | grep "avahi|postfix"

Solution

Install the required RPMs by using the following command or by using the specific options available from your distribution.

rpm -ivh <RPMname> [<RPMname> ...]

where <RPMname> is the name of the required RPM

Reference

See the man page of the 'rpm' command.

11. Health check "crypto_openssl_stack_32bit" (back to top)

Component

crypto/openssl

Title

Verify the availability of System z cryptographic hardware support through an OpenSSL stack

Description

The applications using cryptographic operations by linking to OpenSSL libraries can exploit System z cryptographic hardware only if the OpenSSL stack is configured correctly.

The following cryptographic hardware can be exploited if available:

  • CPACF instructions in the CPU

  • Cryptographic Accelerator adapters

  • Cryptographic Coprocessor adapters

Prerequisites for a well configured OpenSSL stack that uses System z cryptographic hardware are:

  • The enablement of CPACF, a device driver to exploit cryptographic adapters (if adapters are available)

  • The libica library

  • The openssl-ibmca engine for OpenSSL being installed and configured

Configuring the OpenSSL stack to exploit System z cryptographic hardware accelerates applications using cryptographic functions and offloads CPU cycles to cryptographic adapters. The availability of cryptographic adapters is optional because libica provides a software fallback for the functions provided by the adapters.

This health check verifies that:

  • The Cryptographic Coprocessor or Accelerator is available

  • Required RPMs, such as 'openSSL', 'openssl-ibmca', and 'libica' are available

  • OpenSSL is configured with the 'ibmca' engine

Dependencies

sys_platform=s390x

(sys_distro=RHEL and sys_rhel_version>=5.4) or (sys_distro=SLES and sys_sles_version>=11)

Authors

Nageswara R Sastry <nasastry@in.ibm.com>

Exception "32bit_rpms_not_installed"

Severity

high

Summary

Required 32bit RPMs are not installed (&rpm_summ;)

Explanation

Required 32bit RPMs are not installed. The Linux system cannot exploit the cryptographic hardware. Some applications or libraries will emulate cryptographic operations in software instead, but this emulation will decrease the system performance exceedingly.

The following 32bit RPMs are required but not installed: &rpm;

To verify whether the required RPMs are installed, issue:

rpm -qa | grep "<RPMname>"

where <RPMname> is the name of a required RPM

Examples:

In case of a single RPM that contains the string 'avahi', issue:

rpm -qa | grep "avahi"

In case of a list of RPMs that contain a string 'avahi' and 'postfix', issue:

rpm -qa | grep "avahi|postfix"

Solution

Install the required RPMs by using the following command or by using the specific options available from your distribution.

rpm -ivh <RPMname> [<RPMname> ...]

where <RPMname> is the name of the required RPM

Reference

See the man page of the 'rpm' command.

Exception "crypto_adapters_not_available"

Severity

medium

Summary

Required cryptographic adapters are not available (&crypto_hw;)

Explanation

Required cryptographic adapters are not available. The Linux system will emulate cryptographic operations by software. This emulation will decrease the system performance exceedingly.

The following cryptographic adapters are required but not available: &crypto_hw;

To verify whether a required cryptographic adapter is available, issue:

lszcrypt

and look for the following line in the output:

card<nn>: CEXxy
where
x denotes the series of the cryptographic adapter, such as 2 or 3
y denotes the type of the cryptographic adapter: 'C' for Coprocessor, and 'A' for Accelerator.

Solution

If the cryptographic adapter is not attached to the Linux system, follow the procedure described in the 'Technical Guide' related to your System z. For example, for a System z server z196, refer to the 'IBM zEnterprise 196 Technical Guide'.

If the cryptographic adapter is attached but not online, you can use the 'chzcrypt' tool to set coprocessors online. See the 'Generic cryptographic device driver' chapter in 'Device Drivers, Features, and Commands'.

Reference

You can obtain Technical Guides from http://www.redbooks.ibm.com

You can obtain the 'Device Drivers, Features, and Commands' publication from http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

Exception "ibmca_not_configured"

Severity

high

Summary

OpenSSL is not configured with the 'ibmca' engine

Explanation

OpenSSL is not configured with the 'ibmca' engine. The System z cryptographic hardware cannot be exploited. The cryptographic operations like encryption, or decryption will be emulated by software. This emulation will decrease the system performance exceedingly.

To verify the OpenSSL configuration, issue:

 openssl engine -c

If OpenSSL is configured with the 'ibmca' engine, details related to 'ibmca' will be displayed.

Solution

Configure the openssl.cnf file with the 'ibmca' engine. To know where the openssl.cnf file is located, issue:

 rpm -ql openssl | grep openssl.cnf

You will see a different path for each distribution. If you see two entries for the same file, one is for the 32-bit RPM.

To retrieve the required data for enabling the 'ibmca' engine, issue:

 rpm -ql openssl-ibmca

A list of openssl-ibmca files is displayed.

Open the sample configuration file 'openssl.cnf.sample-s390' and verify if the 'libibmca.so' path is correct. Copy the contents of the sample file into the 'openssl.cnf' file.

Reference

See the man page of the 'openssl' and 'rpm' commands and the README file of the openssl-ibmca package.

12. Health check "crypto_z_module_loaded" (back to top)

Component

crypto/zmodule

Title

Confirm that the System z cryptography kernel module is loaded

Description

Loading the System z cryptography kernel module (named 'z90crypt or 'zcrypt_pcixcc' or 'zcrypt_cex2a') is required to exploit cryptographic adapters. This check verifies that the kernel module is loaded.

The System z cryptography kernel module is a mandatory prerequisite for cryptographic operations in the following contexts:

  • the secure key openCryptoki (PKCS#11) software stack (see also checks crypto_opencryptoki_skc and crypto_opencryptoki_skc_32bit)

  • the Common Cryptographic Architecture (CCA) software stack (see also check crypto_cca_stack)

In addition the module is a prerequisite for accelerating and off-loading RSA operations in the following contexts:

  • the OpenSSL software stack (see also checks crypto_openssl_stack and crypto_openssl_stack_32bit)

  • the clear key openCryptoki (PKCS#11) software stack (see also checks crypto_opencryptoki_ckc and crypto_opencryptoki_ckc_32bit)

In these contexts RSA operations will be computed in software if the cryptographic kernel module is not available.

Finally, loading the kernel module is also required to implement true random number generation based on cryptographic adapter hardware.

Dependencies

sys_platform=s390 or sys_platform=s390x

Authors

Peter Oberparleiter <peter.oberparleiter@de.ibm.com>

Exception "module_not_loaded"

Severity

medium

Summary

System z cryptography kernel module is not loaded

Explanation

The System z cryptography kernel module named 'z90crypt' or 'zcrypt_pcixcc' or 'zcrypt_cex2a' is not loaded. When this kernel module is not loaded, the Linux system cannot exploit cryptographic adapters.

Secure key cryptographic functions and true random number generation are only supported if the system has access to a cryptographic coprocessor adapter. RSA operations are supported by both coprocessor and accelerator adapters. Some applications or libraries will emulate clear key RSA operations in software if no cryptographic adapters are accessible, but this emulation will decrease the system performance exceedingly.

While the kernel module is not loaded, the following health checks will also be not applicable:

  • crypto_cca_stack

  • crypto_opencryptoki_skc and crypto_opencryptoki_skc_32bit

In addition acceleration and off-loading of RSA operations is not possible in cryptographic clear key. This affects the following checks:

  • crypto_openssl_stack and crypto_openssl_stack_32bit

  • crypto_opencryptoki_ckc and crypto_opencryptoki_ckc_32bit

To verify whether System z cryptography kernel module is loaded, issue:

# lsmod | grep "<Module name>"

where <Module name> is the name of the System z cryptography kernel module.

Solution

To use secure key cryptography you need access to a cryptographic coprocessor. If the system has access to one or more cryptographic adapters, load the System z cryptography kernel module using the command:

# modprobe "<Module name>"

where <Module name> is the name of the System z cryptography kernel module. Note that loading the kernel module may result in an error message when there are no cryptographic adapters installed.

For more information about various kernel module parameters, see the 'Generic cryptographic device driver' chapter in 'Device Drivers, Features, and Commands'.

If there are no cryptographic adapters installed on your system set the status of this check to inactive.

Reference

See the man page of the 'modprobe' command, and 'Device Drivers, Features, and Commands'.

You can obtain this publication from http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

13. Health check "css_ccw_blacklist" (back to top)

Component

css/ccw

Title

Identify I/O devices that are in use although they are on the exclusion list

Description

The I/O device exclusion list prevents Linux from sensing and analyzing I/O devices that are available to Linux but not required.

An initial exclusion list can be included in the boot configuration using the "cio_ignore" kernel parameter. On a running Linux instance, the list can be changed temporarily through the /proc/cio_ignore procfs interface or with the "cio_ignore" command. Rebooting restores the exclusion list of the boot configuration.

I/O devices that are in use (online) might be required and should then not be on the exclusion list. If these devices become unavailable and reappear after some time, they are ignored and remain unavailable to Linux. If they are added to the cio_ignore parameter in the boot configuration, they will also be unavailable after rebooting Linux.

Dependencies

sys_platform=s390 or sys_platform=s390x

Authors

Nageswara R Sastry <nasastry@in.ibm.com>

Exception "online_devices_ignored"

Severity

medium

Summary

The following I/O devices are in use although they are on the exclusion list: &sum_online_ignored;

Explanation

Some of the I/O devices on the exclusion list are in use (online). The exclusion list should only contain devices that are not required by the Linux instance.

If any device that is on the current exclusion list becomes unavailable and reappears after some time, it is ignored and remains unavailable to Linux. If a device is added to the cio_ignore parameter in the boot configuration, it will also be unavailable after rebooting Linux.

The I/O devices with the following bus IDs are both in use and on the exclusion list:

&online_ignored;

To display the current exclusion list issue:

cat /proc/cio_ignore

Use the "lscss" command to investigate which I/O devices are in use. For unused devices the "Use" column is blank; for online devices this column contains the value "yes".

Solution

Verify that all I/O devices on the exclusion list are not needed by your Linux instance and are excluded intentionally.

Remove any I/O devices from the exclusion list that are on the list by mistake. For example, use the "cio_ignore" command if your distribution provides it. Alternatively, you can issue a command like this:

echo free <device_bus_id> > /proc/cio_ignore

Be sure not to add required devices to the "cio_ignore" kernel parameter in the boot configuration.

Reference

For more information about the I/O device exclusion list, see the section about the cio_ignore kernel parameter in "Device Drivers, Features, and Commands". You can obtain this publication from

http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

Also see the "lscss" and "cio_ignore" man pages.

14. Health check "css_ccw_chpid_status" (back to top)

Component

css/ccw

Title

Check for CHPIDs that are not available

Description

Unavailable CHPIDs can cause I/O stalls and errors and might result in required I/O devices that are not visible within Linux. This check analyzes sysfs status information to identify CHPIDs that are unavailable because of a "configure standby" or a "vary offline" operation. These operations are commonly performed as part of hardware maintenance procedures and need to be reverted after maintenance has finished.

Dependencies

sys_platform=s390 or sys_platform=s390x

Authors

Peter Oberparleiter <peter.oberparleiter@de.ibm.com>

Exception "unused_cfg_off"

Severity

low

Summary

One or more CHPIDs are in the "standby" configuration state (&unused_cfg_summary;)

Explanation

One or more Channel-Path IDs (CHPIDs) are in the "standby" configuration state. While in the "standby" configuration state, CHPIDs are not used for I/O, and devices that are connected through a CHPID in this configuration state might not be visible to Linux. CHPIDs are usually put into this configuration state during maintenance of the attached I/O hardware. Operational CHPIDs have the configuration state "configured".

The following CHPIDs are in the "standby" configuration state:

&unused_cfg_list;

Use the "lschp" command to investigate the configuration state of your CHPIDs. CHPIDs with the "standby" configuration state have the value 0 in the "Cfg." column of the command output.

Solution

If a CHPID has the configuration state "standby" but the devices attached through the CHPID are ready for use, you can use several methods to return the configuration state to "configured".

From the Linux command line:

chchp -c 1 0.<chpid>

For Linux on z/VM, from z/VM CP:

VARY ONLINE CHPID <chpid>

For Linux in LPAR mode:

Use the "Configure Channel Path On/Off" task of the Hardware Management Console to change the configuration state of a CHPID from "standby" to "configured".

Reference

For more information about the configuration state of a CHPID see:

  • The man page of the "lschp" command

  • The man page of the "chchp" command

  • The section about "chchp" command in "Device Drivers, Features, and Commands"; you can find this publication at

    http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

  • "z/VM: CP Commands and Utilities Reference"; you can find this publication at

    http://www.ibm.com/vm/library

  • The applicable "Hardware Management Console Operations Guide"; you can obtain this publication from IBM Resource Link at

    http://www.ibm.com/servers/resourcelink

Exception "unused_vary_off"

Severity

low

Summary

One or more CHPIDs are logically offline (&unused_vary_summary;)

Explanation

One or more Channel-Path IDs (CHPIDs) are varied offline within Linux. Such CHPIDs are logically offline to Linux, that is, even if the CHPID is operational on the mainframe hardware, Linux does not use it for I/O. Devices that are connected through a CHPID that is logically offline might not be visible to Linux. CHPIDs are usually varied offline during maintenance of the attached I/O hardware. For regular operations, CHPIDs are varied online.

The following CHPIDs are varied offline:

&unused_vary_list;

Use the "lschp" command to investigate the logical state of your CHPIDs. CHPIDs that are varied offline in Linux have the value 0 in the "Vary" column of the command output.

Solution

If a CHPID has been varied offline within Linux but the devices attached through the CHPID are ready for use, you can vary the CHPID back online with a command like this:

chchp -v 1 0.<chpid>

Reference

For more information about the logical state of a CHPID see:

  • The man page of the "lschp" command

  • The man page of the "chchp" command

  • The section about "chchp" command in "Device Drivers, Features, and Commands"; you can find this publication at

    http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

Exception "used_cfg_off"

Severity

high

Summary

One or more online I/O devices are connected through CHPIDs that are in the "standby" configuration state (&used_cfg_dev_summary;)

Explanation

One or more I/O devices are connected through at least one Channel-Path ID (CHPID) that is in the "standby" configuration state. While in this configuration state, CHPIDs are not used for I/O. As a result, load balancing does not include all installed CHPIDs to the device, resulting in degraded performance.

Also, if further CHPIDs become unavailable, the connection to the device might be lost completely, resulting in I/O stalls and errors. CHPIDs are usually put into the "standby" configuration state during maintenance of the attached I/O hardware. Operational CHPIDs have the configuration state "configured".

The following devices are online and are connected through at least one CHPID that is in the "standby" configuration state:

&used_cfg_dev_list;

Use the "lscss" command to identify I/O devices with unavailable CHPIDs. In the command output there is a row for each device. If the values in the columns "PIM" and "PAM" differ, one or more channel paths to the device are unavailable.

The following CHPIDs are in the "standby" configuration state:

&used_cfg_chp_list;

Use the "lschp" command to investigate the configuration state of your CHPIDs. CHPIDs with the "standby" configuration state have the value 0 in the "Cfg." column of the command output.

Solution

When an affected CHPID is ready for use, you can use several methods to return its configuration state to "configured".

From the Linux command line:

chchp -c 1 0.<chpid>

For Linux on z/VM, from z/VM CP:

VARY ONLINE CHPID <chpid>

For Linux in LPAR mode:

Use the "Configure Channel Path On/Off" task of the Hardware Management Console to change the configuration state of a CHPID from "standby" to "configured".

Reference

For more information about the configuration state of a CHPID see:

  • The man page of the "lschp" command

  • The man page of the "lscss" command

  • The man page of the "chchp" command

  • The section about the "chchp" command in "Device Drivers, Features, and Commands"; you can find this publication at

    http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

  • "z/VM: CP Commands and Utilities Reference"; you can find this publication at

    http://www.ibm.com/vm/library

  • The applicable "Hardware Management Console Operations Guide"; you can obtain this publication from IBM Resource Link at

    http://www.ibm.com/servers/resourcelink

Exception "used_vary_off"

Severity

high

Summary

One or more online I/O devices are connected through CHPIDs that are logically offline (&used_vary_dev_summary;)

Explanation

One or more I/O devices are connected through at least one Channel-Path ID (CHPID) that is varied offline within Linux. Such CHPIDs are logically offline to Linux, that is, even if the CHPID is operational on the mainframe hardware, Linux does not use it for I/O. As a result, load balancing does not include all installed CHPIDs to the device, resulting in degraded performance. Also, if further CHPIDs become unavailable, the connection to the device might be lost completely, resulting in I/O stalls and errors. CHPIDs are usually varied offline during maintenance of the attached I/O hardware. For regular operations, CHPIDs are varied online.

The following devices are online and are connected through at least one CHPID that is varied offline:

&used_vary_dev_list;

The following CHPIDs are varied offline:

&used_vary_chp_list;

To confirm that an online device is connected through one or more CHPIDs that have been varied offline, first use the "lscss" command to find out which CHPIDs connect the device. Then use the "lschp" command to see which of these CHPIDs have been varied offline. CHPIDs that are varied offline in Linux have the value 0 in the "Vary" column of the command output.

Solution

When an affected CHPID is ready for use, you can vary it back online with a command like this:

chchp -v 1 0.<chpid>

Reference

For more information about the configuration state of a CHPID see:

  • The man page of the "lschp" command

  • The man page of the "lscss" command

  • The man page of the "chchp" command

  • The section about "chchp" command in "Device Drivers, Features, and Commands"; you can find this publication at

    http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

  • "z/VM: CP Commands and Utilities Reference"; you can find this publication at

    http://www.ibm.com/vm/library

  • The applicable "Hardware Management Console Operations Guide"; you can obtain this publication from IBM Resource Link at

    http://www.ibm.com/servers/resourcelink

15. Health check "css_ccw_device_availability" (back to top)

Component

css/ccw

Title

Identify unusable I/O devices

Description

This check examines sysfs information to identify I/O devices for which the availability status indicates that they cannot be used.

Dependencies

sys_platform=s390 or sys_platform=s390x

Authors

Nageswara R Sastry <nasastry@in.ibm.com>

Exception "unusable_device"

Severity

high

Summary

There are unusable I/O devices (&devices_list;)

Explanation

Some I/O devices have an availability status other than "good". Such devices cannot be used for I/O.

The following devices are unusable:

&all_devices;

Use the following command to list your I/O devices with their availability status:

# lscss --avail

If the "lscss" command is not available, read the sysfs availability attribute of each device to check the availability status:

# cat /sys/bus/ccw/devices/<device_bus_id>/availability

Solution

  • For DASD, check whether other systems hold device reservations for the affected devices.

  • Confirm that the physical connections to the affected devices are in place and secure.

  • Run the "lscss" command to check if the channel paths to the affected devices are available. In the command output there is a row for each device. If the values in the columns "PIM" and "PAM" differ, one or more channel paths to the device are unavailable.

Reference

For more information about the "availability" status, see the section about common sysfs attributes for CCW devices in "Device Drivers, Features, and Commands". You can obtain this publication from

http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

Also see the "lscss" man page.

16. Health check "css_ccw_device_usage" (back to top)

Component

css/ccw

Title

Check for an excessive number of unused I/O devices

Description

Even when they are unused (offline), I/O devices consume memory and CPU time both during the boot process and when I/O configuration changes occur on a running system. In particular, when new I/O devices or I/O paths become available or when existing I/O devices or I/O paths become unavailable, resources are wasted to unused I/O devices.

This check uses the "lscss" command to identify unused I/O devices.

Dependencies

sys_platform=s390 or sys_platform=s390x

Authors

Nageswara R Sastry <nasastry@in.ibm.com>

Parameter "device_print_limit"

Description

Threshold for the absolute number of unused (offline) I/O devices. If the number of unused I/O devices exceeds this threshold, an exception is issued. Valid values are positive integers.

Default value

5

Parameter "ratio_limit"

Description

Threshold for the percentage of unused (offline) I/O devices. If this threshold is exceeded, an exception is issued. Valid values are integers in the range 1 to 100.

Default value

50

Exception "many_unused_devices"

Severity

low

Summary

Of &total_devices; I/O devices, &offline_devices; (&ratio;%) are unused

Explanation

The number of unused (offline) I/O devices, &offline_devices; (&ratio;%) of a total of &total_devices;, exceeds the specified threshold.

During the boot process, Linux senses and analyzes all available I/O devices, including unused devices. Therefore, unused devices unnecessarily consume memory and CPU time. Similarly, memory and CPU resources are wasted for unused I/O devices when new I/O devices or I/O paths become available or when existing I/O devices or I/O paths become unavailable.

Use the "lscss" command to investigate which I/O devices are unused. For unused devices the "Use" column is blank; for online devices this column contains the value "yes".

Solution

Use the "cio_ignore" feature to exclude I/O devices that you do not need from being sensed and analyzed. Be sure not to inadvertently exclude required devices.

To exclude devices, you can use the "cio_ignore" kernel parameter or a command like this:

echo "add <device_bus_id>" > /proc/cio_ignore

where <device_bus_id> is the bus ID of an I/O device to be excluded.

If your distribution includes the "cio_ignore" command, you can also use this command to exclude I/O devices from being sensed and analyzed.

Reference

For more information about the "cio_ignore" feature, see the section about the "cio_ignore" kernel parameter in "Device Drivers, Features, and Commands". You can obtain this publication from

http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

Also see the "lscss" and "cio_ignore" man pages.

17. Health check "css_ccw_driver_association" (back to top)

Component

css/ccw

Title

Identify I/O devices that are not associated with a device driver

Description

When an I/O device is sensed, the associated device driver should automatically be loaded. I/O devices that are not associated with a device driver cannot be used properly.

Possible reasons for this problem are that the required device driver module has been unloaded, that an existing association between the device and the device driver has been removed, or that the device is not supported.

This check identifies devices that, in sysfs, do not have a symbolic link to a device driver.

Dependencies

sys_platform=s390 or sys_platform=s390x

Authors

Nageswara R Sastry <nasastry@in.ibm.com>

Exception "no_driver"

Severity

medium

Summary

One or more I/O devices are not associated with a device driver: &devices_list;

Explanation

One or more I/O devices cannot be used properly because they not associated with a device driver.

Possible reasons for this problem are that the required device driver module has been unloaded, that an existing association between the device and the device driver has been removed, or that the device is not supported.

The following I/O devices are not associated with a device driver:

&all_devices;

Each device has a device type and a control unit (CU) type. Each device driver provides a list of supported combinations of device type and CU type. Linux uses this information to associate devices with device drivers. The sysfs directories of devices with a device-driver association include a symbolic link "driver". This link points to the sysfs directory of the associated device driver.

To verify that an I/O device with bus ID <device_bus_id> is not associated with a device driver, confirm that there is no symbolic link "driver" in the following sysfs directory:

/sys/bus/ccw/devices/<device_bus_id>

Solution

  1. If the kernel module of the required device driver has been unloaded, load it again. For example, issue:

    modprobe <module_name>

    where <module_name> is the name of the required device driver module. You can use the "modinfo" command to find out which combinations of device type and CU type are supported by a device driver module.

  2. Try to create the missing association of the I/O device with its device driver. For example, issue:

    echo <device_bus_id> > /sys/bus/ccw/drivers/<module_name>/bind

    Alternatively, try to create the association by issuing:

    echo <device_bus_id> > /sys/bus/ccw/drivers_probe
  3. Verify that the device is supported.

  4. If you cannot establish an association between the I/O device and a device driver, contact your support organization.

Reference

For information about supported devices, see:

  • The release notes of your distribution

  • The applicable version of "Device Drivers, Features, and Commands". You can find this publication at

    http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

For information about investigating kernel modules, see the "modinfo" man page.

18. Health check "fc_remote_port_state" (back to top)

Component

fibre channel/remote port

Title

Identify unusable Fibre Channel(FC) remote ports

Description

When the I/O devices are not reachable via remote ports, this indicates storage server problems. The remote ports should be available for better performance and availability.

Dependencies

n/a

Authors

Nageswara R Sastry <nasastry@in.ibm.com>

Parameter "ignore_bus_id"

Description

List of FC device bus-IDs by which associated remote ports are excluded. This parameter accepts listed FC host bus-IDs as comma ',' separated values.

Default value

n/a

Exception "rports_not_usable"

Severity

high

Summary

There are unusable Fibre Channel (FC) remote ports: &rport_summ;

Explanation

Some remote ports have a state other than "Online". Such remote ports cannot be used for connecting to I/O devices.

The following remote ports are unusable:

Remote Target Port FC Host Bus ID WWPN State

&rport;

Read the sysfs port_state attribute of each remote port to check the port state:

For example, issue:

# cat /sys/class/fc_remote_ports/<rport>/port_state

where <rport> is an unusable Fibre Channel remote port.

Solution

Search for Storage Server errors and resolve them using the documentation of your Storage Server.

Reference

For more information about "Fibre Channel", refer to "How to use FC-attached SCSI devices with Linux on System z". For more information about "port_state", refer to "Device Drivers, Features, and Commands".

You can obtain the above publications from http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

19. Health check "fs_boot_zipl_bootmap" (back to top)

Component

filesystem/boot

Title

Verify that the bootmap file is up-to-date

Description

With a backlevel bootmap file, you might no longer be able to boot your Linux instance.

This check compares the file metadata to verify that none of the boot data that is referenced by the bootmap file has been modified after the bootmap file was created. The boot data typically includes, a kernel image, initial RAM disk (initrd), and a kernel parameter file.

A backlevel bootmap file can be the result of upgrading the kernel with a new kernel image without running "zipl" to update the bootmap file accordingly.

This check applies only if the following assumptions are all true:

  • The boot device is a disk device.

  • The bootmap file has been created from specifications in the "zipl" configuration file, /etc/zipl.conf.

  • /etc/zipl.conf describes a single boot configuration that can but need not provide a boot menu.

Distributions tools typically use "zipl" according to these assumptions when creating a boot disk.

Dependencies

sys_platform=s390 or sys_platform=s390x

Authors

Rajesh K Pirati <rapirati@in.ibm.com>

Exception "outdated_bootmap"

Severity

medium

Summary

Boot records appear out of date, reboot might fail

Explanation

The bootmap file is backlevel. You might no longer be able to boot your Linux instance.

The bootmap file references files with boot data, such as a kernel image, an initial RAM disk (initrd), and a kernel parameter file. One or more of the referenced files have been modified after the bootmap file was created. The "zipl" tool creates and updates the bootmap file on the boot disk according to specifications in /etc/zipl.conf.

On the boot disk, check the time when the bootmap file was last changed. View /etc/zipl.conf and identify the boot data files for the boot configuration. Check the time when each of the referenced files was last changed. The bootmap file must not be older than any of the referenced files with boot data.

Solution

Run the "zipl" command to update the bootmap file.

This check applies only if the following assumptions are all true:

  • The boot device is a disk device.

  • The bootmap file has been created from specifications in the "zipl" configuration file, /etc/zipl.conf.

  • /etc/zipl.conf describes a single boot configuration that can but need not provide a boot menu.

If these assumptions do not apply to your Linux instance, omit this check to suppress further warnings in the future.

Reference

For more information about the "zipl" and booting Linux, see "Device Drivers, Features, and Commands". You can obtain this publication from

http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

Also see the "zipl" man page.

20. Health check "fs_fstab_dasd_devnodes" (back to top)

Component

filesystem/fstab

Title

Identify standard DASD device nodes in the fstab file

Description

The DASD device driver creates standard device nodes for DASDs that are based on the order in which DASDs are set online. When you add or remove disks, the device name of a disk might change across a reboot. To preserve the mapping between standard device nodes and the associated physical disk space across reboot, use device nodes that are based on unique properties of a DASD and so identify a particular device.

Dependencies

n/a

Authors

Rajesh K Pirati <rapirati@in.ibm.com>

Exception "volatile_devnodes_used"

Severity

medium

Summary

Standard DASD device nodes are used in the fstab file.

Explanation

The fstab file contains standard DASD device nodes that have been created by the DASD device driver.

The DASD device driver creates standard device nodes for disks in the order in which they are set online. When you add or remove disks, the standard device node of a disk might change across a reboot. To preserve the mapping between device nodes and the associated physical disks, use device nodes that are based on unique properties for a disk. Such device nodes are independent of the sequence in which the devices are set online and can help you to reliably address an intended disk space.

The following file systems use a standard DASD device node in file /etc/fstab: &fs_exp;

Solution

Use the udev-created device nodes to be sure that you access a particular physical disk space, regardless of the device node that is assigned to it.

For example, in the file system information in /etc/fstab you could replace the following specifications:

/dev/dasdzzz1 /temp1 ext3 defaults 0 0
/dev/dasdzzz2 /temp2 ext3 defaults 0 0

with these specifications:

/dev/disk/by-path/ccw-0.0.b100-part1 /temp1 ext3 defaults 0 0
/dev/disk/by-path/ccw-0.0.b100-part2 /temp2 ext3 defaults 0 0

Reference

See the man pages of the "fstab" file, and the "udev" utility. For more information about DASD device nodes, see the section about the DASD device driver in "Device Drivers, Features, and Commands". You can obtain this publication from

http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

21. Health check "fs_fstab_fsck_order" (back to top)

Component

filesystem/fstab

Title

Check if filesystems are skipped by filesystem check (fsck)

Description

This check examines if the filesystems are skipped by filesystem check (fsck) while boot. If filesystems are not checked for consistency it might lead to filesystem corruption or drive might not even boot the system.

Dependencies

n/a

Authors

Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>

Parameter "exclude"

Description

A list of filesystems, separated by colons (:). The filesystems mounted at the specified mount points are to be excluded from the consistency check. Special filesystems like /proc, /sys etc need not be checked for consistency.

Example:

/proc:/sys

Default value

none

Parameter "mount"

Description

A list of filesystems, separated by colons (:). The filesystems mounted at the specified mount points are to be checked for consistency. If the list is empty, all mount points of /etc/fstab except in exclude list are checked.

Example:

/:/home

Default value

n/a

Exception "filesystem_not_checked"

Severity

medium

Summary

These filesystems are not checked by filesystem check (fsck): &filesystem_list_summary;

Explanation

Several filesystems are not checked for file consistency during boot time. If the filesystems are not checked for consistency it might corrupt the filesystem and also system might fail to come up.

These filesystems are not checked:

&filesystem_not_checked;

Solution

Replace the 6th field (fs_passno) of /etc/fstab entry from 0 to 2

For example:

If the current setting is

/dev/disk/by-path/ccw-0.0.eb7e-part2 /mnt ext3 defaults   0  0

edit /etc/fstab to reflect the below change

/dev/disk/by-path/ccw-0.0.eb7e-part2 /mnt ext3 defaults   0  2

Reference

See the man page of the "fstab" file and "fsck" command.

Exception "root_low_prio_check"

Severity

medium

Summary

Root filesystem is checked with low priority by filesystem check (fsck)

Explanation

Root filesystem needs to be checked first before any other filesystem. Root filesystem contains important system related data so it needs to be checked prior to other filesystems.

Solution

Replace 6th field (fs_passno) of /etc/fstab entry from 2 to 1.

For example:

If the current setting is

/dev/disk/by-path/ccw-0.0.eb7e-part1 / ext3 defaults   0  2

edit /etc/fstab to reflect the below change

/dev/disk/by-path/ccw-0.0.eb7e-part1 / ext3 defaults   0  1

Reference

See the man page of the "fstab" file and "fsck" command.

Exception "root_not_checked"

Severity

high

Summary

Root filesystem is not checked by filesystem check (fsck)

Explanation

Root filesystem needs to be checked first before any filesystem. If root filesystem is not checked for consistency it might corrupt the filesystem and Linux instance might fail to start up.

Solution

Replace 6th field (fs_passno) of /etc/fstab entry from 0 to 1.

For example:

If the current setting is

/dev/disk/by-path/ccw-0.0.eb7e-part1 / ext3 defaults   0  0

edit /etc/fstab to reflect the below change

/dev/disk/by-path/ccw-0.0.eb7e-part1 / ext3 defaults   0  1

Reference

See the man page of the "fstab" file and "fsck" command.

22. Health check "fs_inode_usage" (back to top)

Component

filesystem

Title

Check file systems for an adequate number of free inodes

Description

Many Linux file systems maintain metadata about file system objects (for example, files or folders) in inodes. Each object has a separate inode. When a file system runs out of free inodes, no further files or folders can be created, even if plenty of free disk space is available.

Some applications and administrative tasks require an adequate number of free inodes on each mounted file system. If there are not enough free inodes, these applications might no longer be available or the complete system might be compromised. Regular monitoring of inode usage can avert this risk.

Dependencies

n/a

Authors

Nageswara R Sastry <nasastry@in.ibm.com>

Parameter "critical_limit"

Description

Usage of the available inodes of the file system (in percent) at which to raise a high-severity exception. Valid values are integers in the range 1 to 100.

Default value

95

Parameter "mount_points"

Description

A list of mount points, separated by colons (:). The file systems mounted at the specified mount points are to be checked for free inodes. If the list is empty, all mounted file systems are checked.

Example:

/mnt:/home/mymnt:/usr/data/myapp

Default value

n/a

Parameter "warn_limit"

Description

Usage of the available inodes of the file system (in percent) at which to raise a low-severity exception. Valid values are integers in the range 1 to 100.

Default value

80

Exception "critical_limit"

Severity

high

Summary

The critical threshold of &param_critical_limit;% inode usage is exceeded on some file systems (&critical_exceeded_list_summary;)

Explanation

The percentage of used inodes on one or more file systems has exceeded the specified critical threshold of &param_critical_limit;%.

Many Linux file systems maintain metadata about file system objects (for example, files or folders) in inodes. Each object has a separate inode. When a file system runs out of free inodes, no further files or folders can be created, even if plenty of free disk space is available.

Further increase in the number of used inode is likely to compromise the availability of an application or of the complete system.

The following file systems exceed the threshold:

&critical_exceeded_list;

To view the current inode usage, run the "df -i" command with no parameters.

Solution

Free some inodes on the affected file systems. For example, delete obsolete files and move directories to other file systems. Consider re-creating the file system, with more inodes. For the options required to control the number of inodes, see the mkfs.<filesystem> man page, where <filesystem> specifies your file system type.

ATTENTION: Re-creating a file system destroys all contained data. Back up the data on the file system before you start.

Reference

See the man page of the "df" command and of the "mkfs.<filesystem>" command for your file system.

Exception "warn_limit"

Severity

low

Summary

The warning threshold of &param_warn_limit;% inode usage is exceeded on some file systems (&warn_exceeded_list_summary;)

Explanation

The percentage of used inodes on one or more file systems has exceeded the specified warning threshold of &param_warn_limit;%.

Many Linux file systems maintain metadata about file system objects (for example, files or folders) in inodes. Each object has a separate inode When a file system runs out of free inodes, no further files or folders can be created, even if plenty of free disk space is available.

Further increase in the number of used inodes might compromise the availability of an application or of the complete system.

The following file systems exceed the threshold:

&warn_exceeded_list;

To view the current inode usage, run the "df -i" command with no parameters.

Solution

Monitor the disk inode usage. Consider deleting obsolete files or moving directories to free some inodes.

Reference

See the man page of the "df" command.

23. Health check "fs_mount_option_ro" (back to top)

Component

filesystem

Title

Check for read-only filesystems

Description

This check examines if filesystems have been mounted as read-only. If it has been mounted as read-only it would inhibit any kind of filesystem operations like editing, deleting files/folders etc.

Dependencies

n/a

Authors

Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>

Parameter "mount_points"

Description

A list of mount points, separated by colons (:). The filesystems mounted at the specified mount points are to be checked for read-only. If the list is empty, all mounted filesystems are checked.

Example:

/home:/proc

Default value

/home:/proc:/tmp:/var/log:/sys

Exception "read_only_filesystem"

Severity

high

Summary

One or more filesystems have been mounted as read-only

Explanation

Filesystems mounted as read-only would inhibit filesystem operations on the drives like editing, deleting files/folders etc.

The following filesystems have been mounted as read-only:

&read_only_filesystems;

To view the mount points with their respective options, run the "mount" command with no parameters.

Solution

Read-only filesystems can be remounted as read-write by the following command:

mount -o remount,rw <device> <mount_point>

where device is the read-only filesystem mounted on mount_point

If a filesystem has been intentionally mounted as read-only for security reasons, remove the filesystem mount point from the mount_points check parameter.

Reference

See the man page of the "mount" command.

24. Health check "fs_tmp_cleanup" (back to top)

Component

filesystem/temp

Title

Verify that temporary files are deleted at regular intervals.

Description

Linux instances have one or more directories, for example, /tmp, to store temporary files. If temporary files are not deleted at regular intervals, they can fill up file systems and cause the Linux instance to run out of disk space.An exception message is issued unless a program is configured to clear directories with temporary files at regular intervals.

Dependencies

(sys_distro=RHEL and sys_rhel_version>=5.0) or (sys_distro=SLES and sys_sles_version>=10)

Authors

Rajesh K Pirati <rapirati@in.ibm.com>

Parameter "temp_dir"

Description

A blank-separated list of directories that contain temporary files. The check verifies that temporary files in the specified directories are deleted at regular intervals.

Default value

/tmp

Exception "max_days_not_set"

Severity

low

Summary

Temporary files are not deleted at regular intervals

Explanation

Temporary files are not deleted at regular intervals because the "MAX_DAYS_IN_TMP" variable is not set or empty. Accumulating temporary files can fill up file systems and cause the Linux instance to run out of disk space. Check the value of the "MAX_DAYS_IN_TMP" variable in "/etc/sysconfig/cron".

Solution

Set a time interval in days with the "MAX_DAYS_IN_TMP" variable in "/etc/sysconfig/cron". All temporary files that are not accessed for more than the specified number of days are deleted.

Reference

Exception "no_cron_job"

Severity

low

Summary

The cron service for deleting temporary files at regular intervals is switched off.

Explanation

The cron service for deleting temporary files at regular intervals is switched off. Accumulating temporary files can fill up file systems and cause the Linux instance to run out of disk space.

Typical Linux installations provide cron jobs to automatically delete temporary files. To automatically run these jobs, the cron service must be switched on.

Use the "chkconfig" command to confirm that the cron service is active. For example, run:

# chkconfig --list |grep cron

Solution

To switch on the cron service, use the "chkconfig" command. The name of the cron service depends on your Linux distribution.

For example, issue:

# chkconfig cron on

Reference

See the "chkconfig" and the cron service man pages.

Exception "temp_dir_miss"

Severity

low

Summary

Temporary files in these directories are not cleared: &tmp_dir_summ;

Explanation

In some directories, temporary files are not deleted at regular intervals. Accumulating temporary files can fill up file systems and cause the Linux instance to run out of disk space. Temporary files in these directories are not cleared at regular intervals:

&tmp_dir_list; Read the value of the "TMP_DIRS_TO_CLEAR" variable in "/etc/sysconfig/cron" to find out which directories are listed for regular clearing.

Solution

Add any directories to be cleared at regular intervals to the "TMP_DIRS_TO_CLEAR" variable in "/etc/sysconfig/cron". If this check reports a directory that should not be cleared, remove it from the "temp_dir" health check parameter.

Reference

See the "lnxhc-check" man page for information about changing check parameters.

Exception "tmp_watch"

Severity

low

Summary

The program that deletes temporary files is not installed.

Explanation

The "tmpwatch" program, which deletes temporary files at regular intervals, is not installed. Accumulating temporary files can fill up file systems and cause the Linux instance to run out of disk space.

To verify if "tmpwatch" is installed, for example, issue:

# rpm -qi tmpwatch

Solution

Install the "tmpwatch" package, for example, issue:

# rpm -ihv tmpwatch-<version>.rpm

Reference

See the "rpm" and "tmpwatch" man pages.

25. Health check "fs_usage" (back to top)

Component

filesystem

Title

Check file systems for adequate free space

Description

Some applications and administrative tasks require an adequate amount of free space on each mounted file system. If there is not enough free space, these applications might no longer be available or the complete system might be compromised. Regular monitoring of disk space usage averts this risk.

Dependencies

n/a

Authors

Peter Oberparleiter <peter.oberparleiter@de.ibm.com>

Parameter "critical_limit"

Description

File system usage (in percent) at which to raise a high-severity exception. Valid values are integers in the range 1 to 100.

Default value

95

Parameter "mount_points"

Description

A list of mount points, separated by colons (:). The file systems mounted at the specified mount points are to be checked for free space. If the list is empty, all mounted file systems are checked.

Example:

/mnt:/home/mymnt/usr/data/myapp

Default value

n/a

Parameter "warn_limit"

Description

File system usage (in percent) at which to raise a low-severity exception. Valid values are integers in the range 1 to 100.

Default value

80

Exception "critical_limit"

Severity

high

Summary

The critical threshold of &param_critical_limit;% disk space usage is exceeded on some file systems (&critical_exceeded_list_summary;)

Explanation

The percentage of used disk space on one or more file systems has exceeded the specified critical threshold of &param_critical_limit;%. Further increase in the amount of used space is likely to compromise the availability of an application or of the complete system.

The following file systems exceed the threshold:

&critical_exceeded_list;

To view the current disk space usage, run the "df" command with no parameters.

Solution

Free disk space on the affected file systems. For example, delete obsolete files and move directories with growing space requirements to separate file systems.

Reference

See the man page of the "df" command.

Exception "warn_limit"

Severity

low

Summary

The warning threshold of &param_warn_limit;% disk space usage is exceeded on some file systems (&warn_exceeded_list_summary;)

Explanation

The percentage of used disk space on one or more file systems has exceeded the specified warning threshold of &param_warn_limit;%. Further increase in the amount of used space might compromise the availability of an application or of the complete system.

The following file systems exceed the threshold:

&warn_exceeded_list;

To view the current disk space usage, run the "df" command with no parameters.

Solution

Monitor the disk space usage. Also plan on freeing some disk space. For example, consider deleting obsolete files or moving directories with growing disk space requirements to separate file systems.

Reference

See the man page of the "df" command.

26. Health check "fw_callhome" (back to top)

Component

firmware/callhome

Title

Confirm that automatic problem reporting is activated

Description

When Linux experiences a kernel panic, the automatic problem reporting feature sends collected problem data to the IBM service organization Hence a system crash automatically leads to a new Problem Management Record (PMR), which can be processed by IBM service.

Omit this check unless a hardware support agreement with IBM is in place and the hardware is enabled for the Remote Support Facility.

Dependencies

sys_platform=s390 or sys_platform=s390x

sys_hypervisor=ZLPAR

Authors

Hendrik Brueckner <brueckner@linux.vnet.ibm.com>

Exception "inactive"

Severity

low

Summary

Automatic problem reporting is disabled

Explanation

With the automatic problem reporting feature, problem data is automatically collected and sent to IBM service. Without this feature, you need to collect the data manually using the appropriate tools and contact IBM service, for example, to open a new Problem Management Record (PMR).

Solution

Omit this check unless a hardware support agreement with IBM is in place and the hardware is enabled for the Remote Support Facility.

To temporarily activate automatic problem reporting on a running Linux instance, run

# sysctl -w kernel.callhome=1

To persistently activate automatic problem reporting, ensure that the /etc/sysctl.conf file contains an entry for "kernel.callhome" and that this entry reads:

kernel.callhome=1

If your Linux distribution uses an /etc/sysctl.d directory, you can also create a separate file with this entry in that directory.

Also ensure that the "sclp_async" kernel module is loaded before sysctl settings are applied. See the documentation of your Linux distribution to automatically load kernel modules during the boot process.

Reference

  • For details about setting system controls, see the sysctl (section 8) and sysctl.conf (section 5) man pages.

  • For more information about automatic problem reporting, see "Device Drivers, Features, and Commands". You can obtain this publication from

    http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

Exception "not_available"

Severity

low

Summary

The automatic problem reporting feature is not available

Explanation

The automatic problem reporting feature is not available on your Linux instance. A possible reason is that the kernel module with this feature is not loaded.

With the automatic problem reporting feature, problem data is automatically collected and sent to IBM service. Without this feature, you need to collect the data manually using the appropriate tools and contact IBM service, for example, to open a new Problem Management Record (PMR).

Solution

Omit this check unless the following conditions apply:

  • Your distribution includes the automatic problem reporting feature.

  • A hardware support agreement with IBM is in place and the hardware is enabled for the Remote Support Facility.

If the check is applicable to your Linux instance, ensure that the "sclp_async" module is loaded, for example, by issuing:

# modprobe sclp_async

Reference

  • For information about loading modules, see the man page of the "modprobe" command.

  • For more information about automatic problem reporting, see "Device Drivers, Features, and Commands". You can obtain this publication from

    http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

27. Health check "fw_cpi" (back to top)

Component

firmware/cpi

Title

Check if control program identification can display meaningful Linux instance names

Description

You can use the control program identification (CPI) to assign names to your Linux instances. The names are used to identify Linux instances, for example, on the Hardware Management Console (HMC) or the service element (SE).

To assign meaningful names to your Linux instances, the CPI needs names for your Linux system and sysplex.

Dependencies

sys_platform = s390 or sys_platform = s390x

sys_hypervisor=ZLPAR

Authors

Hendrik Brueckner <brueckner@linux.vnet.ibm.com>

Exception "no_sysplex_name"

Severity

low

Summary

No sysplex name has been set

Explanation

No sysplex name was set for your Linux instance. The control program identification (CPI) feature uses the sysplex name to identify a Linux instance, for example, on the Hardware Management Console (HMC).

Solution

You can use the attribute sysplex_name in sysfs to specify a sysplex name:

/sys/firmware/cpi/sysplex_name

The sysplex name is a string consisting of up to 8 characters of the following set: A-Z, 0-9, $, @, #, and blank.

To set a sysplex name for a Linux instance, for example SYSPLEX1, issue:

# echo SYSPLEX1 > /sys/firmware/cpi/sysplex_name
# echo 1 > /sys/firmware/cpi/set

Depending on your Linux distribution, you can edit /etc/sysconfig/cpi to persistently set a sysplex name.

Reference

For more information about the control program identification feature, see "Device Drivers, Features, and Commands". You can obtain this publication from

http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

Exception "no_system_name"

Severity

medium

Summary

No system name has been set

Explanation

No system name was set for your Linux instance. The control program identification (CPI) feature uses the system name to identify a Linux instance, for example, on the Hardware Management Console (HMC).

Solution

You can use the attribute system_name in sysfs to specify a system name:

/sys/firmware/cpi/system_name

The system name is a string consisting of up to 8 characters of the following set: A-Z, 0-9, $, @, #, and blank.

To set a system name for a Linux instance, for example LINUX1, issue:

# echo LINUX1 > /sys/firmware/cpi/system_name
# echo 1 > /sys/firmware/cpi/set

Depending on your Linux distribution, you can edit /etc/sysconfig/cpi to persistently set a system name.

Reference

For more information about the control program identification feature, see "Device Drivers, Features, and Commands". You can obtain this publication from

http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

28. Health check "log_syslog_rotate" (back to top)

Component

log

Title

Verify that syslog files are rotated

Description

Syslog files contain the messages that are generated by the system components. If the size of the syslog files is not controlled, they might completely fill up your file system, and cause the Linux instance to run out of disk space.

logrotate is a tool that monitors, rotates, compresses, or truncates syslog files to save disk space and to limit the syslog file size. This health check verifies that logrotate runs at regular intervals on your system and checks your logrotate configuration settings.

Dependencies

(sys_distro=RHEL and sys_rhel_version < 7 and sys_rhel_version >=5) or (sys_distro=SLES and sys_sles_version < 12 and sys_sles_version >=10)

Authors

Rajesh K Pirati <rapirati@in.ibm.com>

Parameter "max_log_size"

Description

Maximum syslog file size, specified in KB, MB, or GB.

Default value

1MB

Exception "log_size_exceeded"

Severity

medium

Summary

These syslog files exceed the defined size: &log_summ;

Explanation

One or more syslog files exceed the maximum size. Large syslog files are difficult to analyze, and might completely fill up your file system. This might cause the Linux instance to run out of disk space.

These syslog files exceed the defined size: &log_exp;

To verify that your syslog files are rotated, confirm that the logrotate settings for each syslog file define a maximum file size and a regular rotation.

Solution

Enable syslog rotation for the listed files in file /etc/logrotate.conf: For example, with the statement

include /etc/logrotate.d

the logrotate tool considers all syslog files that have a configuration file defined in /etc/logrotate.d.

For each of the listed syslog files, create a logrotate configuration file in directory /etc/logrotate.d. In this file, you specify the syslog file name and its logrotate settings. The following settings may be useful:

  • compress

    specifies whether old versions of log files are to be compressed.

  • rotate <number>

    specifies the number of old versions to keep.

  • daily/weekly/monthly

    specifies the time interval to rotate log files.

  • size <file size>

    specifies the maximum file size of a log file. Whenever a log file size is greater than this size, it is rotated.

Reference

See the man page of the "logrotate" command.

Exception "no_cron"

Severity

high

Summary

Syslog file rotation is not enabled because the cron service is not running

Explanation

The cron service for rotating logs at regular intervals is switched off. Accumulating syslog files can fill up the file system, cause the Linux instance to run out of disk space, and impede analysis.

Typical Linux installations provide cron jobs to automatically rotate the syslog files. To automatically run these jobs, the cron service must be switched on.

Use the "chkconfig" command to confirm that the cron service is active. For example, issue:

 chkconfig --list |grep cron

Solution

To switch on the cron service, use the "chkconfig" command. The name of the cron service depends on your Linux distribution.

For example, issue:

 chkconfig cron on

Reference

See the man pages of the "chkconfig" command and the "cron" daemon.

Exception "no_logrotate"

Severity

high

Summary

Syslog file rotation is not enabled because the logrotate package is not installed

Explanation

The logrotate program, which rotates syslog files at regular intervals, is not installed. Accumulating syslog files can fill up the file system, cause the Linux instance to run out of disk space, and impede analysis.

To verify if logrotate is installed, for example, issue:

 rpm -qi logrotate

Solution

Install the logrotate package. For example, issue:

 rpm -ihv logrotate-<version>.rpm

Reference

See the man pages of the "rpm" and "logrorate" commands.

29. Health check "mem_swap_availability" (back to top)

Component

memory/swap

Title

Check if swap space is available

Description

This check examines if swap space is available. For systems having memory constraints if swap space is not available it might lead to out-of-memory situations as also a system crash. Swapping allows the system to use more memory than is physically available.

Dependencies

n/a

Authors

Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>

Exception "no_swap_space"

Severity

high

Summary

System does not have a swap space

Explanation

For systems having memory constraints it is important to have swap space. If swap space is not available in a huge workload scenario it might lead to out-of-memory situations as also a system crash.

Available swap space can be verified either by displaying /proc/swaps or "swapon -s" command.

Solution

Linux has two forms of swap space: the swap partition and the swap file. The swap partition is an independent section of the hard disk used solely for swapping, no other files can reside there. The swap file is a special file in the filesystem that resides amongst your system and data files.

To add swap partition:

  1. Ensure that the partition is marked as a swap partition

    fdisk -l <device>

    System field should be Linux swap / Solaris

  2. Once a partition is marked as swap, you need to prepare it using the mkswap (make swap) command as root:

    mkswap <device>
  3. If no errors are seen, swap space is ready to use. To activate it immediately, type:

    swapon <device>

    Creation of swap partition can be verified by running "swapon -s" command.

To mount the swap space automatically at boot time, you must add an entry to the /etc/fstab file

<device>       none    swap    sw      0       0

To add swap file:

  1. To create a swapfile, use the dd command to create an empty file. To create a 1GB file, type:

    dd if=/dev/zero of=/<swapfile> bs=1048576 count=1024

    swapfile is name of the swapfile and count is size of the file, here it is 1GB.

  2. Prepare the swap file

    mkswap <swapfile>
  3. Mount the swapfile

    swapon /<swapfile>

The /etc/fstab entry for a swap file would look like this:

/<swapfile> none swap sw 0 0

Reference

See the man pages of the "mkswap" and "swapon" command.

30. Health check "mem_usage" (back to top)

Component

memory

Title

Ensure memory usage is within the threshold

Description

The check examines the RAM usage of the system. If not enough RAM is available system will slow down and become much more unresponsive and difficult or even impossible to use.

Dependencies

n/a

Authors

Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>

Parameter "critical_limit"

Description

Memory usage (in percent) at which to raise a critical exception. Valid values are integers in the range of 1 to 100

Default value

90

Parameter "warn_limit"

Description

Memory usage (in percent) at which to raise a warning exception. Valid values are integers in the range of 1 to 100

Default value

80

Exception "critical_limit"

Severity

high

Summary

Memory usage (&critical_limit;%) crosses critical threshold &param_critical_limit;%

Explanation

The percentage of memory usage has exceeded the specified critical threshold of &param_critical_limit;%. Further increase in the memory usage might invoke Out of memory (OOM) and also make the system unresponsive.

Memory consumption details:

&mem_used; MB memory used
&swap_used; MB swap used

Memory usage can be checked with the "free" command.

Total memory available is Mem:total + Swap:total.

Total memory used is -/+ buffers:used + Swap:used.

Solution

  1. Check if any processes are unnecessarily hogging memory and terminate them.

  2. Increase the swap space size.

  3. Increase RAM size.

Reference

Refer to man pages of "free" , "sync" commands and "proc" file system.

Exception "warn_limit"

Severity

low

Summary

Memory usage (&warn_limit;%) crosses warning threshold &param_warn_limit;%

Explanation

The percentage of memory usage has exceeded the specified warning threshold of &param_warn_limit;%. Further increase in the memory usage might invoke Out of memory (OOM) and also make the system unresponsive.

Memory consumption details:

&mem_used; MB memory used
&swap_used; MB swap used

Memory usage can be checked with the "free" command.

Total memory available is Mem:total + Swap:total.

Total memory used is -/+ buffers:used + Swap:used.

Solution

  1. Check if any processes are unnecessarily hogging memory and terminate them.

  2. Increase the swap space size.

  3. Increase RAM size.

Reference

Refer to man pages of "free" , "sync" commands and "proc" file system.

31. Health check "net_bond_ineffective" (back to top)

Component

network/bonding

Title

Identify bonding interfaces that are configured with single network interfaces

Description

Bonding setups are mainly used to increase availability or performance. A bonding interface is a logical interface that aggregates multiple slave interfaces. Bonding interfaces that are configured with only one slave interface do not offer path redundancy or increased bandwidth, so neither goal can be achieved.

Dependencies

n/a

Authors

Nageswara R Sastry <nasastry@in.ibm.com>

Exception "single_slave"

Severity

medium

Summary

These bonding interfaces aggregate only one network interface: &bond_devices;

Explanation

A bonding interface is a logical interface that aggregates multiple slave interfaces. One or more bonding interfaces aggregate only one network slave interface. Bonding interfaces with only one slave interface neither provide path redundancy nor increased bandwidth, and so do not help to increase availability or boost performance. The following bonding interfaces aggregate only one slave interface:

&bond_slaves;

To verify that a bonding interface configured with single slave interface: Issue a command of this form to obtain the slave list for a bonding device, bond<n>:

# cat /proc/net/bonding/bond<n>

where <n> is an index number that identifies the bonding interface.

Solution

Reassign slave interfaces such that one bonding interface aggregates more than one slave interface. For example, change the configuration script of a slave interface to persistently assign it to a different bonding interface. Interface configuration scripts are usually found in the /etc branch of the Linux file system and called ifcfg-<ifname>, where <ifname> is the interface name. The exact location depends on your distribution.

In the script locate a line:

MASTER=bond<n>

where bond<n> is the name of the bonding interface. Change this name to reassign the slave interface to a different bonding interface. The new assignment takes effect after Linux is booted. You can use the "ifenslave" command to temporarily reassign slave interfaces on a running Linux instance.

Issue a command of this form to detach an interface from a bonding interface:

# ifenslave -d <bonding_interface> <network_interface>

Issue a command of this form to attach an interface to a bonding interface:

# ifenslave <bonding_interface> <network_interface>

Reference

  • For more information about the relevant commands, see the "ifenslave" man pages.

  • For information about the parameters of the "bonding" kernel module, issue

    # modinfo bonding
  • For general information about bonding, see the Linux Foundation "Bonding How-to" at

    http://www.linuxfoundation.org/collaborate/workgroups/networking/bonding

32. Health check "net_bond_qeth_ineffective" (back to top)

Component

network/bonding

Title

Identify bonding interfaces that aggregate qeth interfaces with the same CHPID

Description

Bonding setups are mainly used to increase availability or performance. A bonding interface is a logical interface that aggregates multiple slave interfaces. Slave interfaces that are configured with the same CHPID do not offer path redundancy or increased bandwidth, so neither goal can be achieved.

Dependencies

sys_platform=s390 or sys_platform=s390x

Authors

Nageswara R Sastry <nasastry@in.ibm.com>

Exception "single_chpid"

Severity

medium

Summary

These bonding interfaces aggregate qeth interfaces with the same CHPID: &bond_devices;

Explanation

A bonding interface is a logical interface that aggregates multiple slave interfaces. One or more bonding interfaces aggregate qeth slave interfaces that are configured with the same CHPID. Slave interfaces that use the same CHPID neither provide path redundancy nor increased bandwidth, and so do not help to increase availability or boost performance. The following bonding interfaces aggregate slave interfaces that are configured with the same CHPID:

&bond_slaves;

Perform these steps to verify that a bonding interface aggregates slave interfaces that are configured with the same CHPID:

  1. Issue a command of this form to obtain the slave list for a bonding device, bond<n>:

    # cat /proc/net/bonding/bond<n>

    where <n> is an index number that identifies the bonding interface.

  2. Use the "lsqeth" command and find the CHPID for each slave interface in the "CHPID" column of the command output.

Solution

Reassign slave interfaces such that each slave interface of a bonding interface is configured with a different CHPID.

For example, change the configuration script of a slave interface to persistently assign it to a different bonding interface. Interface configuration scripts are usually found in the /etc branch of the Linux file system and called ifcfg-<ifname>, where <ifname> is the interface name. The exact location depends on your distribution.

In the script locate a line:

MASTER=bond<n>

where bond<n> is the name of the bonding interface. Change this name to reassign the slave interface to a different bonding interface. The new assignment takes effect after Linux is booted.

You can use the "ifenslave" command to temporarily reassign slave interfaces on a running Linux instance.

Issue a command of this form to detach an interface from a bonding interface:

# ifenslave -d <bonding_interface> <network_interface>

Issue a command of this form to attach an interface to a bonding interface:

# ifenslave <bonding_interface> <network_interface>

Reference

  • For more information about the relevant commands, see the "ifenslave" and "lsqeth" man pages.

  • For information about the parameters of the "bonding" kernel module, issue

    # modinfo bonding
  • For general information about bonding, see the Linux Foundation "Bonding How-to" at

    http://www.linuxfoundation.org/collaborate/workgroups/networking/bonding

33. Health check "net_dns_settings" (back to top)

Component

network/dns

Title

Ensure nameserver is listed with correct address

Description

Nameserver helps to resolve names to numerical ip addresses. If nameserver is not listed or has an incorrect address it would prevent accessing any of the systems in the network by their name.

Dependencies

n/a

Authors

Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>

Exception "incorrect_nameserver"

Severity

medium

Summary

System has incorrect nameserver address

Explanation

Nameserver helps to convert names to numeric ip addresses. Having an incorrect nameserver address would prevent accessing any of the systems in the network by their name. Adding more than one nameserver would help in better name resolution in a case where previous nameserver fails to resolve the name.

Currently 3 nameservers can be listed.

Incorrect nameservers are &invalid_nameservers;

Solution

Edit the /etc/resolv.conf and add the nameserver address.

nameserver <ip address>

You can confirm if the name resolution is working fine by "host" command.

host <ip> <ip>

where <ip> is the nameserver address, a reverse DNS lookup happens for <ip> (Second parameter) using the DNS server at address <ip> (first parameter) it will give details of domain server.

or you can also confirm by

host <domain_name>

Reference

Refer to man pages of "resolv.conf" file and "host" command.

Exception "no_nameserver"

Severity

medium

Summary

Nameserver is not listed

Explanation

Nameserver helps to convert names to numeric ip addresses. If nameserver is not listed in resolv.conf file it would prevent accessing any of the systems in the network by their name.

Solution

Edit the /etc/resolv.conf and add the nameserver address.

nameserver <ip address>

You can confirm if the nameserver address is right and name resolution is working fine by "host" command.

host <ip> <ip>

where <ip> is the nameserver address, a reverse DNS lookup happens for <ip> (Second parameter) using the DNS server at address <ip> (first paramater) it will give details of domain server.

or you can also confirm by

host <domain_name>

Reference

Refer to man pages of "resolv.conf" file and "host" command.

34. Health check "net_hsi_outbound_errors" (back to top)

Component

network/hsi

Title

Check for an excessive error ratio for outbound HiperSockets traffic

Description

This check examines the transmit (TX) error ratio for HiperSockets network interfaces (hsi). A high TX error ratio can be caused by one or more slow receivers that require attention.

Dependencies

sys_platform=s390 or sys_platform=s390x

Authors

Nageswara R Sastry <nasastry@in.ibm.com>

Parameter "txerror_ratio"

Description

Threshold for the percentage of TX errors by total TX packets for HiperSockets network interfaces. If the ratio of TX errors exceeds this threshold, an exception is raised. Valid values are integers in the range 1 to 100.

Default value

1

Exception "slow_hsi_receivers"

Severity

medium

Summary

One or more HiperSockets interfaces exceeded the TX error threshold: &summ_interface;

Explanation

One or more HiperSockets (hsi) interfaces exceeded the specified error threshold for outbound (TX) traffic. The receiving interface does not have sufficient buffer space for the HiperSockets traffic. High TX error rates for a qeth HiperSockets device are a strong indication for one or more slow receivers.

&devices_txerrors;

You can use the "ifconfig" command to investigate TX errors for your HiperSockets interfaces.

Solution

Examine the receivers of your HiperSockets network traffic. In particular, ensure that a suitable number of buffers has been configured for the receiving interface. The following table provides a general guideline that works well for most cases:

RAM size Number of buffers
up to 500 MB 16
up to 1 GB 32
up to 2 GB 64
more than 2 GB 128

If the receiving interface is on a Linux instance, use the "lsqeth" command to find out how many buffers are defined for the this interface. In the default command output, the buffer count is shown as the value for the buffer_count attribute. With the -p option, the output is in table format and the buffer count is shown in the "cnt" column.

To change the number of buffers for an interface run a command of this form: Offline the interface (Before offline make sure you are not running any critical task using this interface)

# echo 0 > /sys/devices/qeth/<device_bus_id>/online

Change the buffer count

# echo <value> > /sys/devices/qeth/<device_bus_id>/buffer_count

Online the interface

# echo 1 > /sys/devices/qeth/<device_bus_id>/online

where <value> is the new buffer number and <device_bus_id> is the bus ID of the qeth group device that corresponds to the interface. In the "lsqeth" output, this is the first of the three listed bus IDs.

Reference

For information about the commands, see the "lsqeth" and "ifconfig" man pages.

35. Health check "net_inbound_errors" (back to top)

Component

network

Title

Check the inbound network traffic for an excessive error or drop ratio

Description

This check examines network interfaces for a high received (RX) error ratio or high ratio of dropped RX packets. Problems with received packets lead to performance degradation as packets have to be resent by the originator. High RX error and drop ratios can be caused by insufficient memory.

Dependencies

n/a

Authors

Nageswara R Sastry <nasastry@in.ibm.com>

Parameter "rxdrop_ratio"

Description

Threshold for the percentage of dropped RX packets by total RX packets. If the ratio of dropped RX packets exceeds this threshold for a network interface, an exception message is issued. Valid values are integers in the range 1 to 100.

Default value

1

Parameter "rxerror_ratio"

Description

Threshold for the percentage of RX errors by total RX packets. If the ratio of RX errors exceeds this threshold for a network interface, an exception message is issued. Valid values are integers in the range 1 to 100.

Default value

1

Exception "limits_exceeded"

Severity

medium

Summary

These network interfaces exceeded the error or dropped packets threshold for inbound traffic: &summ_interface;

Explanation

One or more network interfaces exceeded the received (RX) error or dropped packets threshold. Problems with received packets lead to performance degradation as packets have to be resent by the originator.

Insufficient main memory can cause both RX packet errors and packets being dropped. If increasing the available memory aggravates the RX dropped ratio, the problem might be that the maximum unprocessed packets, as defined by net.core.netdev_max_backlog, is reached. This occurs if input buffers are not processed fast enough.

These network interfaces have exceeded the threshold for the RX error ratio or RX dropped ratio:

&devices_rxerrors;

You can use the "ifconfig" command to investigate the RX error and drop rate for your network interfaces.

Solution

Increase the maximum tolerated number of unprocessed packets per interface.

To read the current value, issue:

# sysctl net.core.netdev_max_backlog

To set the maximum to a higher value, <value>, issue:

# sysctl -w net.core.netdev_max_backlog=<value>

If a higher maximum of unprocessed packets does not resolve the problem, increase the main memory (RAM). Follow the appropriate steps for your system hardware and Linux distribution.

Reference

See the "sysctl" and "ifconfig" man pages.

36. Health check "net_qeth_buffercount" (back to top)

Component

network/qeth

Title

Identify qeth interfaces that do not have an optimal number of buffers

Description

The most suitable number of buffers for a particular interface depends on the available memory. To allow for memory constraints, many Linux distributions use a small number of buffers by default. On Linux instances with ample memory and a high traffic volume, this can lead to performance degradation, as incoming packets are dropped and have to be resent by the originator. This check uses a set of rules that correlate memory size and number of buffers to evaluate the settings for each qeth interface.

Dependencies

sys_platform=s390 or sys_platform=s390x

Authors

Nageswara R Sastry <nasastry@in.ibm.com>

Parameter "recommended_buffercount"

Description

The rule set used to evaluate the interface settings. The rule set comprises a set of comma-separated rules. Each rule specifies a particular memory size or implies a range of memory sizes and the number of buffers to be used. The rules are evaluated from left to right. The first rule that applies to the available memory defines the number of buffers demanded by the check.

Each rule has the form:

<operator><memsize>:<buffer_count>

Where:

  • <operator> is one of these comparison operators:

    • == (equal)

    • <= (equal or smaller)

    • >= (equal or greater)

    • < (smaller)

    • > (greater)

  • <memsize> specifies an amount of memory. Valid values are numbers followed by one of the units KB (for kilobyte), MB (for megabyte), or GB (for gigabyte). Note that this number is compared against the amount of available memory which may be lower than the total memory assigned to a Linux system due to kernel internal overhead.

  • <buffer_count> is the number of buffers to be used for the specified memory size. Valid values are 16, 32, 64 and 128.

Example:

<=500MB:16,<=1GB:32,<=2GB:64,>2GB:128

The rule set of the example demands 16 buffers if the memory is 500 MB or less, 32 buffers if the memory is more than 500 MB but not more than 1 GB, 64 buffers if the memory is more than 1 GB but not more than 2 GB, and 128 buffers if the memory is more than 2 GB.

Default value

<=500MB:16,<=900MB:32,<=1900MB:64,>1900MB:128

Exception "inefficient_buffercount"

Severity

medium

Summary

These network interfaces do not have the expected number of buffers: &summ_interface;

Explanation

The number of buffers of one or more network interfaces diverts from the specified rule. The most suitable number of buffers for a particular interface depends on the available memory. To allow for memory constraints, many Linux distributions use a small number of buffers by default. On Linux instances with ample memory and a high traffic volume, this can lead to performance degradation, as incoming packets are dropped and have to be resent by the originator.

For the current main memory, &mem; GB, interfaces should have &rec_bc; buffers.

The following interfaces have a different number of buffers:

&interface_bc;

To find out if there are problems with the affected interfaces, check the output of the "'ifconfig" command for errors and dropped packets.

Use the "lsqeth" command to confirm the current setting for the number of buffers. In the default command output, the buffer count is shown as the value for the "buffer_count" attribute. With the -p option, the output is in table format and the buffer count is shown in the "cnt" column.

Solution

For each affected interface, change the number of buffers to &rec_bc;.

To temporarily change the number of buffers on a running Linux instance, run a command of this form:

Offline the interface (Before offline make sure you are not running any critical task using this interface)

# echo 0 > /sys/devices/qeth/<device_bus_id>/online

Change the buffer count

# echo &rec_bc; > /sys/devices/qeth/<device_bus_id>/buffer_count

Online the interface

# echo 1 > /sys/devices/qeth/<device_bus_id>/online

where <device_bus_id> is the bus ID of the qeth group device that corresponds to the interface. In the "lsqeth" output, this is the first of the three listed bus IDs.

How to make this setting persistent across reboots depends on your distribution. Some distributions set the number through scripts located below /etc/sysconfig, other distributions use udev rules. For details, see the documentation that is provided with your distribution.

The suggested buffer size is derived from a general best-practice rule that is expressed by the "recommended_buffercount" check parameter, and that works well in many setups. If your current settings work to your satisfaction and you do not want to change them, you can adapt the "recommended_buffercount" parameter to your needs or omit this check to suppress further warnings in the future.

Reference

For more information, see the section about inbound buffers in the qeth chapter of "Device Drivers, Features, and Commands". You can obtain this publication from

http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

37. Health check "net_services_insecure" (back to top)

Component

network/services

Title

Identify network services that are known to be insecure

Description

This check finds network services that are active but known to be insecure. Such services can compromise your data and system security. An example of an insecure network service is a network file system service that does not provide user authentication. Any user who can reach this service can access the data. Other network services might be considered insecure because they do not encrypt credentials and data. If network traffic from such services is intercepted, data might be disclosed to unauthorized parties and the system might become vulnerable to intrusion.

Examples of insecure network services are ftp, rsh, rlogin, and telnet.

Dependencies

(sys_distro=RHEL and sys_rhel_version>=5.0) or (sys_distro=SLES and sys_sles_version>=10)

Authors

Rajesh K Pirati <rapirati@in.ibm.com>

Parameter "insecure_services"

Description

A list of insecure network services to check for. In the list, services are separated by blanks. The default includes the most commonly used insecure network services Add any services that are installed on your system and that you consider insecure.

Default value

tftp telnet rsh rlogin

Exception "insecure_services"

Severity

medium

Summary

One or more active network services are known to be insecure: (&insecure_services_summary;)

Explanation

Insecure network services can, potentially, compromise your data and system security. Insecure services might lack user authentication or transmit credentials and data without encryption.

The following insecure network services are active on your system:

&insecure_services_list;

Solution

Secure your system, for example, by taking one or more of the following actions:

  • Disable any insecure network services that are not required. For example, to disable telnet issue:

    # chkconfig telnet off
  • Instead of the insecure network services, use network services that provide SSL/TLS encryption features. For example, use SSH File Transfer Protocol (SFTP) or FTP-SSL instead of FTP.

  • Set up a firewall to prevent unauthorized parties from accessing the insecure network services.

  • Make sure the services are only available on secured network connections.

To prevent exception messages about services that you are aware of and do not consider a threat, remove these services from the "insecure_services" parameter.

Reference

  • For information about disabling services, see the "chkconfig" man page

  • For information about changing check parameters, see the "lnxhc" man page.

38. Health check "proc_cpu_usage" (back to top)

Component

process/cpu

Title

Ensure processes do not hog cpu time

Description

This check ensures that processes do not end up hogging cpu time. If certain processes start hogging cpu time then other processes would be deprived of cpu time which might cause applications to slow down and the system might even become unresponsive.

Dependencies

n/a

Authors

Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>

Parameter "cpu_time"

Description

Per process accumulated cpu time in seconds which must be exceeded before an exception is reported.

Valid values are integers starting with 1.

Default value

300

Parameter "cpu_usage"

Description

Per process cpu usage at which to raise a high-severity exception. The cpu usage represents the percentage of time that a process spent running during its lifetime.

Valid values are integers in the range 1 to 100.

Default value

80

Parameter "processes"

Description

A list of processes separated by comma (,) that are expected to consume high cpu time and which need not be reported by this check. If the list is empty, all the processes consuming high cpu time are reported.

Example:

firefox, apache2

Default value

n/a

Exception "process_hogs_cpu"

Severity

high

Summary

One or more processes are hogging cpu time

Explanation

If processes end up hogging cpu time then other processes could be deprived of cpu time which in turn may cause applications to slow down and the system might become unresponsive.

List of processes hogging cpu time:

&hogging_procs;

Solution

To terminate the process that hogs cpu time and which is not required issue the following command:

kill -9 <pidofprocess>

Reference

Refer to the man pages of "top" and "kill" command.

39. Health check "proc_load_avg" (back to top)

Component

process

Title

Ensure the system is running with optimal load

Description

This check examines the CPU and IO load of the system. If a system is running with very high loads, then system could turn unresponsive and applications can take more time to react/complete.

Dependencies

n/a

Authors

Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>

Parameter "avgload"

Description

System usage (in percent) at which to raise a exception. Valid values are integers in the range 1 to 100.

Default value

90

Parameter "time"

Description

Time for which load average needs to be checked. Valid values are 1, 5, 15. You can specify more than one timestamp separated by comma (,).

Example:

1,15

Default value

15

Exception "high_load"

Severity

medium

Summary

System is highly loaded

Explanation

When a system is highly loaded, there is a possible risk of system and existing programs getting slow down and become much more unresponsive and difficult or even impossible to use.

Statistics: &high_load;

Solution

Current cpu utilization of processes can be verified by top command. Examine the processes/applications hogging cpu and terminate them if not required.

Run "top" command, it displays top processes with high cpu usage.

To kill the process that hogs cpu issue the following command:

kill -9 <pidofprocess>

To monitor processes performing IO, issue the following command:

iotop

Disk IO statistics can be verified by iostat command (part of the sysstat package):

iostat

Reference

Refer to man pages of "top", "iotop" and "iostat".

Exception "over_load"

Severity

high

Summary

System is overloaded

Explanation

When a system is overloaded, the system and existing programs slow down and become much more unresponsive and difficult or even impossible to use.

Statistics: &over_load;

Solution

Examine the processes/applications hogging cpu and terminate them if not required.

Run "top" command, it displays top processes with high cpu usage.

To kill the process that hogs cpu issue the following command:

kill -9 <pidofprocess>

To monitor processes performing IO, issue the following command:

iotop

Disk IO statistics can be verified by iostat command (part of the sysstat package):

iostat

Reference

Refer to the man pages of "top", "iotop" and "iostat".

40. Health check "proc_mem_oom_triggered" (back to top)

Component

process/memory

Title

Check the kernel message log for out-of-memory (OOM) occurrences

Description

When a Linux instance runs out of memory, the OOM killer recovers memory by killing one or more processes. If important process get killed, they might need to be restarted and protected from the OOM killer.

Frequent OOM occurrences indicate that too little memory is available for a given workload or that an application is consuming an undue amount of memory. Awareness of OOM occurrences can disclose resource shortages or help identify malfunctioning applications.

Dependencies

(sys_distro=RHEL and sys_rhel_version>=5.0) or (sys_distro=SLES and sys_sles_version>=10)

Authors

Rajesh K Pirati <rapirati@in.ibm.com>

Nageswara R Sastry <nasastry@in.ibm.com>

Exception "processes_killed"

Severity

medium

Summary

The OOM killer killed one or more processes (&process_list_summary;)

Explanation

Because of a severe shortage of available memory, the out-of-memory killer (OOM killer) recovered memory by killing some processes. These processes were affected:

&processes_pid_list;

To find out more about the OOM occurrences, check /var/log/messages or the dmesg output for entries that begin with "Out of memory".

An algorithm assigns a priority to each process. Processes with high priority get killed first when an OOM condition occurs.

The priority is expressed as a number in /proc/<pid>/score, where <pid> is the process ID. To investigate the priorities, use the ps command to list your processes with the process IDs, and then read the priorities for processes of interest.

Solution

Restart any important processes that were killed.

To prevent OOM conditions in the future, consider adding more memory or swap space. Also, ensure that there are no memory leaks in the applications you are running.

You can influence the priority for a process by writing a value in the range -17 to 15 to /proc/<pid>/oom_adj, where <pid> is the process ID. The lower this value, the lower becomes the priority in /proc/<pid>/score. For -17 the OOM killer is disabled for the process.

Reference

For more details, see the section about the procfs and OOM in "Red Hat Enterprise Linux 6 Deployment Guide". You can obtain this publication from http://redhat.com/docs.

41. Health check "proc_mem_usage" (back to top)

Component

process/memory

Title

Ensure processes do not hog memory

Description

This check ensures that processes do not end up hogging memory. If certain processes start hogging memory other processes would be deprived of memory and applications might slow down and system might even become unresponsive.

Dependencies

n/a

Authors

Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>

Parameter "mem_usage"

Description

Per process memory usage at which to raise a high-severity exception. Valid values are integers in the range 1 to 100.

Default value

90

Parameter "processes"

Description

A list of processes separated by comma (,) that are expected to consume high memory and which need not be reported by this check. If the list is empty, all the processes consuming high memory are reported.

Example:

firefox, apache2

Default value

lnxhc

Exception "process_hogs_memory"

Severity

high

Summary

One or more processes is hogging memory

Explanation

If processes end up hogging memory other processes would be deprived of memory which in turn might cause applications to slow down and system might become unresponsive.

List of processes hogging memory:

&hogging_mem;

Solution

  1. If processes / applications reported are behaving as expected then you can avoid future reporting of such processes by adding them to the list:

    &param_processes;

  2. Check if the application is configured properly.

    If step 1 and step 2 does not have any issue restart the application or you can terminate if its no longer required by the following command:

    kill -9 <pidofprocess>

Reference

Refer to the man pages of "top" and "kill" command.

42. Health check "proc_priv_dump" (back to top)

Component

process

Title

Ensure that privilege dump is switched off

Description

With the privilege dump set to non-zero value (1 or 2), a core-dump of set-user-id or otherwise protected/tainted binaries is created. For security reasons it is important to disable it by default.

Dependencies

n/a

Authors

Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>

Exception "debug_mode"

Severity

high

Summary

Privilege dump debug mode is on

Explanation

With privilege dump set to 1 all processes dump core when possible. The core dump is owned by the file system user ID of the dumping process and no security is applied. This is intended for system debugging situations only.

Solution

If system is not being used for debugging purpose, disable the setting in the /etc/sysctl.conf file:

fs.suid_dumpable = 0

This setting becomes active the next time the Linux instance is booted.

To temporarily disable privilege dump on a running Linux instance, issue the following command:

sysctl -w fs.suid_dumpable=0

Reference

See the man pages of the "sysctl" command and of the "sysctl.conf" configuration file.

Exception "suidsafe_mode"

Severity

medium

Summary

Privilege dump suid safe mode is on

Explanation

With privilege dump set to 2 any binary would be dumped and readable by root only hence limiting security. This mode is appropriate when administrators are attempting to debug problems in a normal environment.

Solution

If system is not being used for debugging purpose, disable the setting in the /etc/sysctl.conf file:

fs.suid_dumpable = 0

This setting becomes active the next time the Linux instance is booted.

To temporarily disable privilege dump on a running Linux instance, issue the following command:

sysctl -w fs.suid_dumpable=0

Reference

See the man pages of the "sysctl" command and of the "sysctl.conf" configuration file.

43. Health check "ras_dump_kdump_on_panic" (back to top)

Component

ras/dump

Title

Ensure kdump is configured and running

Description

This check examines if kdump is configured and running on the system. If kdump is not running and a system crash occurs, crash dump will not be captured and system will not be available for use. Kdump allows the system to come back after crash along with crash dump which can be used for post-morterm analysis.

Dependencies

n/a

Authors

Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>

Exception "no_kdump"

Severity

high

Summary

Kdump is not operational

Explanation

Your Linux instance's kdump is not operational. In case a kernel panic occurs it would not be able to automatically capture the dump for post-morterm analysis.

You can confirm the kdump status by issuing following command:

For Redhat:

service kdump status

For Suse:

service boot.kdump status

Solution

There is a possibility of kdump service being stopped. Restart the kdump service by issuing the following command:

For Redhat:

service kdump restart

For Suse:

service boot.kdump restart

Reference

Refer to http://www.dedoimedo.com/computers/kdump.html

Exception "no_kdump_crash"

Severity

high

Summary

kdump is not configured

Explanation

Your Linux instance does not have the crashkernel loaded. In case a kernel panic occurs it would not be able to automatically capture the dump for post-morterm analysis.

Solution

  1. If memory is not reserved for crashkernel, reserve it by passing the crashkernel= kernel parameter.

    If the crashkernel= kernel parameter is already specified, check for the proper values of the crashkernel parameter. Refer to the appropriate distribution's configuration settings.

    crashkernel=X@Y

    X is the size of the crashkernel.

    Y is the offset at which crashkernel will be loaded.

    Once crashkernel= parameter is specified with right offset and size confirm if memory is reserved for crashkernel either by issuing:

    1. cat /sys/kernel/kexec_crash_size

      It should be a non-zero value.

      or

    2. dmesg | head

      It will indicate the memory reserved for crashkernel. In case memory reservation fails it contains relevant error messages.

  2. Load the kdump kernel and initrd using the kexec-tools suite.

Typically, this setup is done for you by your Linux distribution.

Reference

Refer to http://www.dedoimedo.com/computers/kdump.html

44. Health check "ras_dump_on_panic" (back to top)

Component

ras/dump

Title

Confirm that the dump-on-panic function is enabled

Description

With the dump-on-panic function enabled, a dump is automatically created if a kernel panic occurs. Without this function you have to create a dump yourself. Dumps can only be created if dump tools and possibly dump devices are in place.

Dependencies

sys_platform=s390 or sys_platform=s390x

Authors

Hendrik Brueckner <brueckner@linux.vnet.ibm.com>

Exception "no_dumpconf"

Severity

high

Summary

The dumpconf service is not active

Explanation

Your Linux instance is configured for dump-on-panic, but the dumpconf service is not automatically started during the boot process.

Solution

To configure and activate the dump service, complete these steps:

  1. Edit /etc/sysconfig/dumpconf and configure the dump-on-panic action. Possible actions are dump, dump_reipl, or vmcmd with a CP VMDUMP command.

  2. Activate the dumpconf service with chkconfig and then start the service.

Reference

See the dumpconf man page.

For more information about the dump tools available for Linux on System z, see "Using the Dump Tools". You can obtain this publication from

http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

Exception "no_kdump"

Severity

high

Summary

kdump is not configured

Explanation

Your Linux instance is not configured for kdump. Enable kdump to automatically create a dump if a kernel panic occurs.

Solution

You typically configure kdump with tools provided by your Linux distribution.

To manually configure kdump, complete these steps:

  1. Use the crashkernel= kernel parameter to reserve memory for the crash kernel. For example, specify crashkernel=128M.

  2. Issue the zipl command and reboot your Linux instance.

  3. Load the kdump kernel and initrd using the kexec-tools suite. For example,

    # kexec -p <image> --initrd <initrd> --command-line "<kparms>"

    where <image> specifies the kdump image and <initrd> specifies the initial RAM disk of the kdump kernel. The <initrd> can be omitted if the kdump kernel does not require an initial RAM disk. The <kparms> option specifies kernel parameters for the kdump kernel.

Reference

For more information about the dump tools available for Linux on System z, see "Using the Dump Tools". You can obtain this publication from

http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

Exception "no_kdump_dumpconf"

Severity

medium

Summary

The dumpconf service is not active

Explanation

Your Linux instance is configured for dump-on-panic as a fallback for an existing kdump setup, but the dumpconf service is not automatically started during the boot process.

Solution

To configure and activate the dump service, complete these steps:

  1. Edit /etc/sysconfig/dumpconf and configure the dump-on-panic action. Possible actions are dump, dump_reipl, or vmcmd with a CP VMDUMP command.

  2. Activate the dumpconf service with chkconfig and then start the service.

Reference

See the dumpconf man page.

For more information about the dump tools available for Linux on System z, see "Using the Dump Tools". You can obtain this publication from

http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

Exception "no_kdump_standalone"

Severity

low

Summary

dumpconf is not configured as kdump fallback

Explanation

The standalone dump configuration using dumpconf is not configured as a fallback for kdump.

Solution

To configure a fallback for an existing kdump setup, complete these steps:

  1. Plan and prepare your dump device.

  2. Edit /etc/sysconfig/dumpconf and configure the dump-on-panic action. Possible actions are dump, dump_reipl, or vmcmd with a CP VMDUMP command.

  3. Activate the dumpconf service with chkconfig and then start the service.

Reference

See the dumpconf man page.

For more information about the dump tools available for Linux on System z, see "Using the Dump Tools". You can obtain this publication from

http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

Exception "no_standalone"

Severity

high

Summary

The dump-on-panic function is not enabled

Explanation

Your Linux instance is not configured for dump-on-panic. Configure dump-on-panic to automatically create a dump if a kernel panic occurs.

Solution

To configure dump-on-panic, complete these steps:

  1. Plan and prepare your dump device.

  2. Edit /etc/sysconfig/dumpconf and configure the dump-on-panic action. Possible actions are dump, dump_reipl, or vmcmd with a CP VMDUMP command.

  3. Activate the dumpconf service with chkconfig and then start the service.

Reference

See the dumpconf man page.

For more information about the dump tools available for Linux on System z, see "Using the Dump Tools". You can obtain this publication from

http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

45. Health check "ras_panic_on_oops" (back to top)

Component

ras

Title

Ensure that panic-on-oops is switched on

Description

If the Linux instance experiences a kernel oops, the instance can no longer be trusted to work correctly. The panic-on-oops setting ensures that the Linux instance is stopped if this occurs.

Dependencies

n/a

Authors

Hendrik Brueckner <brueckner@linux.vnet.ibm.com>

Exception "no_panic_on_oops"

Severity

medium

Summary

The panic-on-oops setting is disabled

Explanation

Without the panic-on-oops setting, a Linux instance might keep running after experiencing a kernel oops. After the oops, the instance might work incorrectly and possibly damage data.

Solution

Activate panic-on-oops through the following setting in the /etc/sysctl.conf file:

kernel.panic_on_oops = 1

This setting becomes active the next time the Linux instance is booted.

To temporarily activate panic-on-oops on a running Linux instance, issue the following command:

sysctl -w kernel.panic_on_oops=1

Reference

See the man pages of the "sysctl" command and of the "sysctl.conf" configuration file.

46. Health check "scsi_dev_state" (back to top)

Component

scsi/device

Title

Identify unusable SCSI devices

Description

SCSI devices that cannot be used might be damaged or simply need to be set back online.

Dependencies

n/a

Authors

Nageswara R Sastry <nasastry@in.ibm.com>

Exception "not_usable"

Severity

high

Summary

There are unusable SCSI devices &lun_summ;

Explanation

Some SCSI devices have a state other then "running". This might indicate that the connection to a storage system is working but the storage system has a problem. The SCSI device is considered as being not operational and cannot be used for I/O.

The following SCSI devices are unusable: &lun;

Read the sysfs state attribute of each SCSI device to check the state:

For example, issue:

# cat /sys/bus/scsi/devices/<devname>/state

where <devname> is the SCSI device name

Solution

Attempt to manually set the device online again by echoing "running" into the sysfs state attribute.

For example, issue:

# echo "running" > /sys/bus/scsi/devices/<devname>/state

where <devname> is the SCSI device name

If the problem persists, the storage hardware might be damaged, and you should check the storage hardware using the documentation of your Storage Server.

Reference

For more information about SCSI device state, see "Device Drivers, Features, and Commands". You can obtain this publication from

http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

47. Health check "sec_tty_root_login" (back to top)

Component

security/terminal

Title

Confirm that root logins are enabled for but restricted to secure terminals

Description

The login program and the Linux Pluggable Authentication Modules (PAM) configuration restrict root logins to the terminals listed in /etc/securetty.

This check verifies that root logins are enabled for all terminals that are considered secure. This check also verifies that no root logins are permitted on terminals that are considered insecure.

Root logins on multiple terminals might be helpful in emergency situations. However, root logins on insecure terminals constitute a security exposure.

Dependencies

n/a

Authors

Hendrik Brueckner <brueckner@linux.vnet.ibm.com>

Parameter "insecure_ttys"

Description

A blank-separated list of terminals that are considered insecure, and for which root logins must not be permitted. When specifying terminals, omit the leading /dev/.

An exception message is issued if any terminal here is also listed in /etc/securetty.

Default value

n/a

Parameter "secure_ttys"

Description

A blank-separated list of terminals that are considered secure, and for which root logins should be permitted. When specifying terminals, omit the leading /dev/.

An exception message is issued if any terminal listed here is missing in /etc/securetty.

Default value

ttyS0 ttyS1 ttysclp0 sclp_line0 hvc0 hvc1 hvc2 hvc3 hvc4 hvc5 hvc6 hvc7

Exception "insecure_enabled"

Severity

medium

Summary

There are insecure terminals on which root logins are permitted

Explanation

Listing a terminal in /etc/securetty permits root logins on this terminal. Permitting root logins on insecure terminals constitutes a security exposure.

The following terminals are listed in /etc/securetty but have been specified as insecure:

&list_insecure;

Solution

Remove the terminal from /etc/securetty. Alternatively, if you consider the terminal secure, remove the terminal from the insecure_ttys check parameter.

Reference

See the man pages of the "login" program, the "pam_securetty" module, and the "securetty" configuration file.

Exception "secure_disabled"

Severity

medium

Summary

There are secure terminals on which root logins are not permitted

Explanation

Root logins are permitted only on terminals that are listed in the /etc/securetty file. Restricting root logins can prevent system access in emergencies.

The following terminals have been specified as secure but are not listed in /etc/securetty:

&list_secure;

Solution

Append the terminal to the /etc/securetty file. Alternatively, if you consider the terminal insecure, remove the terminal from the secure_ttys check parameter.

Reference

See the man pages of the "login" program, the "pam_securetty" module, and the "securetty" configuration file.

48. Health check "sec_users_uid_zero" (back to top)

Component

security/users

Title

Screen users with superuser privileges

Description

This check examines the output of command "getent passwd" to identify user names that run with numerical user ID (UID) 0. These users have superuser privileges that are conventionally associated with user "root".

Users with UID 0 and the processes started by these users can inadvertently or maliciously disrupt, damage, manipulate, or destroy a system. Generally, UID 0 must be assigned sparingly and only to trusted user names. Security policies often restrict UID 0 to user name "root".

Dependencies

(sys_distro=RHEL and sys_rhel_version>=5.0) or (sys_distro=SLES and sys_sles_version>=10)

Authors

Rajesh K Pirati <rapirati@in.ibm.com>

Parameter "trusted_superusers"

Description

A list of user names that are trusted to run as superusers with UID 0. In the list, the user names are separated by blanks.

Default value

root

Exception "non_root_uid0"

Severity

medium

Summary

These users with UID 0 are not listed as trusted superusers: (&non_root_user_ids;)

Explanation

Users with numerical user ID (UID) 0 run with superuser privileges that are conventionally associated with user "root". These users and the processes they start can inadvertently or maliciously disrupt, damage, manipulate, or destroy a system. Generally, UID 0 must be assigned sparingly and only to trusted user names. Security policies often restrict UID 0 to user name "root".

The "trusted_superusers" parameter of this check identifies the following user names as trusted to run with UID 0:

&param_trusted_superusers;

The following user names are not in the list of trusted superusers but run with UID 0:

&non_uid_root_list;

Note: The usernames added from external services will start with symbols '+,-'. These usernames will not be reported. For example:

+username
+
+@username
-username
-
-@username

Solution

Examine the list of user names that run with UID 0 and assess whether they need to be superusers and can be trusted with superuser privileges.

For user names that should not or need not run as superusers, change the UID from 0 to a non-zero unused UID. For example, issue a command like this:

usermod -u <UID> <user name>

To prevent this check from issuing further warnings about legitimate superusers, add their user names to the "trusted_superusers" check parameter.

Reference

  • For more information about changing user properties, see the "usermod" man page.

  • For information about changing check parameters, see the "lnxhc" man page.

49. Health check "storage_dasd_cdl_part" (back to top)

Component

storage/dasd

Title

Identify CDL-formatted DASD where the metadata area is used for storing data

Description

Compatible Disk Layout (CDL) formatted DASD should have a partition and the partition should not start before track 2. Otherwise data corruptions might occur. Also the metadata which is stored in tracks 0 and 1 can be corrupted. Metadata contains partition tables and volume labels that are required by other operating systems, for example, z/OS. If metadata is corrupted, other operating systems might no longer recognize the disk.

On CDL formatted devices, the first blocks are formatted with a non-standard block size. And for what ever the data written, while reading it gives back only '0xE5'. The first two tracks of CDL DASDs contain meta-data such as volume labels and partition tables. The volume labels are required so that the disk can be recognized by other operating systems (e.g. z/OS). If these are overwritten, the disk contents will no longer be recognized by these operating systems.

Dependencies

sys_platform=s390 or sys_platform=s390x

Authors

Nageswara R Sastry <nasastry@in.ibm.com>

Exception "invalid_partition_start"

Severity

high

Summary

There are CDL-formatted DASDs with invalid partition starts: &track_data_sum;

Explanation

There are Compat Disk Layout (CDL) formatted DASDs with invalid partition starts. On CDL-formatted DASDs, the first tracks contains metadata such as partition tables and volume labels. These information are required to access the DASD from other operating systems, for example, z/OS.

If partition starts within the first two tracks, metadata can be corrupted.

These are DASDs with invalid parition starts:

&track_data;

To confirm, issue the "lsdasd" command to display DASDs. For each DASD, use the "dasdview -x -t info /dev/<dasd>" command and look for the track start in the table that follows "Other s/390 and zSeries operating systems would see the following data sets".

Solution

For each DASD with invalid partition start, complete these steps:

  1. Backup existing data

  2. Low-level format the DASD with CDL. For example, you can use

    # dasdfmt -d cdl /dev/<dasd>
  3. Partition the DASD with the "fdasd" command.

  4. Restore the data from the backup. Depending on your backup mechanism you might create file system before restoring the data.

Reference

  • See also the man pages for the "lsdasd", "dasdview", and "dasdfmt" commands.

  • To partition a DASD, see the man page of the "fdasd" command.

Exception "no_partition_found"

Severity

medium

Summary

There are CDL-formatted DASDs without partitions: &no_part_sum;

Explanation

There are Compatible Disk Layout (CDL) formatted DASDs with out partitions. If you use a DASD without a partition at a whole, data corruptions might occur.

Here are the list of DASD's not having any partitions. DASDs without partitions are:

&no_part;

To confirm that there rae DASDs without partitions, issue the "lsdasd" command to list DASDs. For each DASD, use the "dasdview -x -t info /dev/<dasd>" command and look for a 'CDL formatted' devices. For each CDL-formatted DASD, run "grep <dasd> /proc/partitions" to display partition information. Partitions are numbers following the device name, for example, 'dasda1' is the first partition of the 'dasda' DASD.

Solution

For each DASD without a partition, complete these steps:

  1. Backup existing data

  2. Low-level format the DASD with CDL. For example, you can use

    # dasdfmt -d cdl /dev/<dasd>
  3. Partition the DASD with the "fdasd" command.

  4. Restore the data from the backup. Depending on your backup mechanism you might create file system before restoring the data.

Reference

  • See also the man pages for the "lsdasd", "dasdview", and "dasdfmt" commands.

  • To partition a DASD, see the man page of the "fdasd" command.

50. Health check "storage_dasd_eckd_blksize" (back to top)

Component

storage/dasd

Title

Confirm 4K block size on ECKD DASD devices

Description

Verify the block size of low-level formatted ECKD DASD devices. If the block size is other than 4096 an exception is reported. A block size of 4096 maps to the default block size of file systems and typically have a good I/O throughput.

Dependencies

sys_platform=s390 or sys_platform=s390x

(sys_distro=RHEL and sys_rhel_version>=5.0) or (sys_distro=SLES and sys_sles_version>=10)

Authors

Nageswara R Sastry <nasastry@in.ibm.com>

Exception "unexpected_eckd_block_size"

Severity

medium

Summary

There are ECKD DASDs with non 4k block size &summary;

Explanation

There are ECKD DASDs that have a block size other than 4096 bytes.

Measurements showed that a block size of 4096 Bytes (4KB) shows the best results for I/O throughput and free disk space after formatting the DASD. Further tests showed that this statement is independent to the request size issued by the application.

The following DASDs have non-4K block sizes:

&details;

To confirm, run the "lsdasd" command to display DASDs and their block sizes.

Solution

To low-level format a DASD with a 4096 bytes block size, complete these steps:

  1. Backup existing data that resides on the disk.

  2. Low-level format the disk with a block size of 4096. For example, you can use the "dasdfmt" command

  3. Restore the backup to the disk. Depending on your backup applications, you might format the disk with a new file system first.

Reference

See the man pages of the "lsdasd" and "dasdfmt" commands.

51. Health check "storage_dasd_nopav_zvm" (back to top)

Component

storage/dasd

Title

Check Linux on z/VM for the "nopav" DASD parameter

Description

This check examines the Linux on z/VM configuration for occurrences of the "nopav" DASD kernel or module parameter. The "nopav" parameter disables parallel access volume (PAV and HyperPAV) for Linux in LPAR mode but has no effect for Linux on z/VM. For Linux on z/VM you cannot disable PAV through Linux settings; configuration steps on z/VM are required instead.

Dependencies

sys_platform=s390 or sys_platform=s390x

sys_hypervisor=ZVM

Authors

Nageswara R Sastry <nasastry@in.ibm.com>

Exception "ineffective_nopav"

Severity

low

Summary

The "nopav" DASD parameter in &module_info_file_path; has no effect

Explanation

The &module_info_file_path; configuration file includes the "nopav" parameter for the DASD device driver. This parameter suppresses parallel access volume (PAV and HyperPAV) enablement for Linux instances that run in LPAR mode. The "nopav" parameter has no effect for Linux on z/VM.

The "nopav" parameter can mislead administrators into expecting that PAV is disabled.

Solution

Use the z/VM CP "QUERY PAV" command to find out which devices are set up for PAV and HyperPAV. Use the z/VM CP "SET CU" command to disable PAV and HyperPAV.

Use the configuration tools provided by your distribution to remove the "nopav" parameter from your Linux on z/VM configuration or complete the following steps to remove the parameter directly from the configuration files:

  1. Open &module_info_file_path; with a text editor.

  2. Find the following line:

    &module_information;
  3. Remove "nopav" from this line.

  4. Search &module_info_file_path; for other occurrences of "nopav" and, if applicable, remove these occurrences from all boot configurations for Linux on z/VM.

  5. Save and close &module_info_file_path;.

If the "nopav" parameter was found in /proc/cmdline, you have to create a new boot configuration and reboot Linux to remove this parameter. If you are using "zipl" to create your boot configurations, you might have to first remove "nopav" from the zipl configuration file, then run "zipl" to create a new boot configuration, and then reboot Linux with the new boot configuration.

Reference

  • For more information about the "nopav" parameter, see the kernel or module parameter section for the DASD device driver in "Device Drivers, Features, and Commands". This publication also has a general section about kernel and module parameters and a section about the "zipl" command. You can obtain this publication from

    http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

  • For more information about PAV and HyperPAV, see "How to Improve Performance with PAV". You can obtain this publication from

    http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

  • For more information about the z/VM CP "QUERY PAV" and "SET CU" commands, see "z/VM CP Commands and Utilities Reference". You can obtain this publication from

    http://www.ibm.com/vm/library

52. Health check "storage_dasd_pav_aliases" (back to top)

Component

storage/dasd

Title

Identify active DASD alias devices without active base device

Description

Alias devices without active base device affect the system performance and indicate a configuration problem. Through the Parallel Access Volume (PAV) feature, storage systems can represent the same physical disk space as a base device and one or more alias devices. With IBM HyperPAV, an alias can be used for any base device within the same logical subsystem on the storage system.

Dependencies

sys_platform=s390 or sys_platform=s390x

(sys_distro=RHEL and sys_rhel_version>=5.4) or (sys_distro=SLES and sys_sles_version>=10.4)

Authors

Nageswara R Sastry <nasastry@in.ibm.com>

Exception "orphaned_alias"

Severity

medium

Summary

The base devices of one or more active DASD alias devices are not online: &busid_sum;

Explanation

The base devices of one or more active DASD alias devices are not online.

Through the Parallel Access Volume (PAV) feature, storage systems can represent the same physical disk space as a base device and one or more alias devices. With IBM HyperPAV, aliases are not exclusively used for the base device for which they are defined. An alias can be used for any base device within the same logical subsystem on the storage system.

Inactive base devices corresponding to active alias devices affect the overall system performance.

The following bus IDs do not have corresponding active base devices:

&busid;

Solution

Go through the listed alias devices and decide whether you still need a device or not. If you need an alias device, set the corresponding base device online. Else, set the alias device offline. You can use the chccwdev command for these tasks.

Run the 'lsdasd -u' command to verify your new configuration.

Reference

For more information about PAV and HyperPAV, see "How to Improve Performance with PAV".

For more information about setting devices online or offline, see "Device Drivers, Features, and Commands".

You can obtain these publications from http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

53. Health check "storage_mp_ineffective" (back to top)

Component

storage/multipath

Title

Identify multipath setups that consist of a single path only

Description

Through a correctly configured multipath setup, a Linux instance has two or more independent connections to the same physical storage device. This path redundancy can be used for load balancing and to maintain availability if one of the paths fails. Multipath setups with only a single path cannot achieve either of these goals.

Dependencies

(sys_distro=RHEL and sys_rhel_version>=5.0) or (sys_distro=SLES and sys_sles_version>=10)

Authors

Rajesh K Pirati <rapirati@in.ibm.com>

Exception "single_path"

Severity

medium

Summary

These multipath devices provide a single path only: &single_device_summ;

Explanation

In Linux, separate paths to the same physical device appear as separate devices. The Linux multipath tools aggregate such devices into a single multipath device. Through a correctly configured multipath setup, a Linux instance has two or more independent connections to the same storage device. This path redundancy can be used for load balancing and to maintain availability if one of the paths fails. Multipath setups with only a single path cannot achieve either of these goals.

To investigate your multipath configuration, use "multipath -ll".

The following multitpath devices are configured with only a single path:

&single;

Solution

Take these actions to investigate your multipath devices:

  1. List your storage devices to ensure that Linux has registered each expected path.

    For example, use "lszfcp -D" to list your SCSI devices.

    The command output consists of lines of the form

    <device_bus_id>/<wwpn>/<hex_lun> <h>:<c>:<id>:<d_lun>

    where each line represents a path to a storage device. Each line has two representations for the LUN of the storage device, <hex_lun> is the hexadecimal format of the LUN and <d_lun> is a decimal representation.

    Lines that represent paths to the same storage device have both identical values for <hex_lun> and identical values for <d_lun>.

    If not all expected paths are shown, ensure that all hardware components are in place and set up to provide multiple paths for your multipath devices.

  2. If multiple paths are available, ensure that the multipath configuration file groups these paths correctly. In particular, ensure that the missing paths are not blacklisted. See the documentation that is provided with your distribution for details of this configuration file.

  3. Restart the multipath daemon, for example, by issuing:

    # /etc/init.d/multipathd restart

Reference

See the documentation that is provided with your distribution for more specific information about multipathing. Also see the man pages for the "multipath", "lsdasd", and "lszfcp" commands.

54. Health check "storage_mp_path_state" (back to top)

Component

storage/multipath

Title

Identify multipath devices with too few available paths or too many failed paths

Description

Through a correctly configured multipath setup, a Linux instance has two or more independent connections to the same physical storage device. This path redundancy can be used for load balancing and to maintain availability if one of the paths fails. Multipath setups with an insufficient number of available paths or an excessive number of failed paths might not meet these goals.

Dependencies

n/a

Authors

Rajesh K Pirati <rapirati@in.ibm.com>

Parameter "failed_path_limit"

Description

Maximum number of failed hardware paths to be tolerated for a multipath device.

Default value

1

Parameter "remaining_path_limit"

Description

Minimum number of available hardware paths to be required for a multipath device.

Default value

2

Exception "too_few_available_paths"

Severity

high

Summary

These multipath devices have less than &param_remaining_path_limit; available paths: &available_path_summ;

Explanation

In Linux, separate paths to the same physical device appear as separate devices. The Linux multipath tools aggregate such devices into a single multipath device. Through a correctly configured multipath setup, a Linux instance has two or more independent connections to the same storage device. This path redundancy can be used for load balancing and to maintain availability if one of the paths fails. Multipath setups with a small number of remaining paths cannot achieve either of these goals.

To investigate your multipath configuration, use

# multipath -ll

The following multipath devices have less than &param_remaining_path_limit; available paths:

&available_path_details;

Solution

Take these actions to investigate your multipath devices:

  1. List your storage devices to ensure that Linux has registered each expected path.

    For example, use "lszfcp -D" to list your SCSI devices. The command output consists of lines of the form

    <device_bus_id>/<wwpn>/<hex_lun> <h>:<c>:<id>:<d_lun>

    where each line represents a path to a storage device. Each line has two representations for the LUN of the storage device, <hex_lun> is the hexadecimal format of the LUN and <d_lun> is a decimal representation.

    Lines that represent paths to the same storage device have both identical values for <hex_lun> and identical values for <d_lun>.

    If not all expected paths are shown, ensure that all hardware components are in place and set up to provide multiple paths for your multipath devices.

  2. If multiple paths are available, ensure that the multipath configuration file groups these paths correctly. In particular, ensure that the missing paths are not blacklisted. See the documentation that is provided with your distribution for details of this configuration file.

  3. Restart the multipath daemon, for example, by issuing:

    # /etc/init.d/multipathd restart

Reference

See the documentation that is provided with your distribution for more specific information about multipathing. Also see the man pages for the "multipath", and "lszfcp" commands.

Exception "too_many_failed_paths"

Severity

medium

Summary

These multipath devices have more than &param_failed_path_limit; failed paths: &failed_path_summ;

Explanation

In Linux, separate paths to the same physical device appear as separate devices. The Linux multipath tools aggregate such devices into a single multipath device. Through a correctly configured multipath setup, a Linux instance has two or more independent connections to the same storage device. This path redundancy can be used for load balancing and to maintain availability if one of the paths fails. Multipath setups with too many failed paths might not meet these goals.

To investigate your multipath configuration, use "multipath -ll".

The following multipath devices have more than &param_failed_path_limit; failed paths:

&failed_path_details;

Solution

Take these actions to investigate your multipath devices:

  1. List your storage devices to ensure that Linux has registered each expected path.

    For example, use "lszfcp -D" to list your SCSI devices. The command output consists of lines of the form

    <device_bus_id>/<wwpn>/<hex_lun> <h>:<c>:<id>:<d_lun>

    where each line represents a path to a storage device. Each line has two representations for the LUN of the storage device, <hex_lun> is the hexadecimal format of the LUN and <d_lun> is a decimal representation.

    Lines that represent paths to the same storage device have both identical values for <hex_lun> and identical values for <d_lun>.

    If not all expected paths are shown, ensure that all hardware components are in place and set up to provide multiple paths for your multipath devices.

  2. If multiple paths are available, ensure that the multipath configuration file groups these paths correctly. In particular, ensure that the missing paths are not blacklisted. See the documentation that is provided with your distribution for details of this configuration file.

  3. Restart the multipath daemon, for example, by issuing:

    # /etc/init.d/multipathd restart

Reference

See the documentation that is provided with your distribution for more specific information about multipathing. Also see the man pages for the "multipath", and "lszfcp" commands.

55. Health check "storage_mp_service_active" (back to top)

Component

storage/multipath

Title

Verify that the multipath service starts automatically when the system launches

Description

This check identifies, if the "multipathd" daemon is active and configured to start at every boot or reboot, in case of existing "multipath" targets. If not configured at the boot time the "multipathd" service will not start automatically and cannot be used. The multipathd daemon is in charge of re-enabling failed paths. When this happens, it will reconfigure the multipath map the path belongs to, so that this map regains its maximum performance and redundancy.

Dependencies

n/a

Authors

Nageswara R Sastry <nasastry@in.ibm.com>

Exception "service_disabled"

Severity

high

Summary

The "multipathd" daemon is not configured to start automatically at every boot or reboot

Explanation

The "multipathd" daemon should be configured to start at every boot or reboot, in case of existing "multipath" targets. If not configured at the boot time, the "multipathd" service will not start automatically and cannot be used.

The multipathd daemon is in charge of re-enabling failed paths. When this happens, it will reconfigure the multipath map the path belongs to, so that this map regains its maximum performance and redundancy.

To manually check whether the "multipathd" service was configured at the boot time or not, use the following commands:

For Linux systems using chkconfig (for example: SLES11, RHEL6), issue:

# chkconfig --list | grep "multipathd"

Example output:

multipathd      0:off   1:off   2:off   3:on   4:off   5:off   6:off

This output shows, that the "multipathd" service is on in runlevel "3".

For Linux systems using systemctl (for example: Fedora 19), issue:

# systemctl list-unit-files --type=service | grep "multipathd"

Example output:

multipathd.service                        enabled

This output shows, that the "multipathd.service" is "enabled".

Solution

Configure the "multipathd" service to run at every boot or reboot.

For Linux systems using chkconfig (for example: SLES11, RHEL6), issue:

# chkconfig multipathd on

For Linux systems using systemctl (for example: Fedora 19), issue:

# systemctl enable multipathd.service

Reference

Refer to the man pages of "runlevel", "chkconfig" and "systemctl" command.

Exception "service_not_running"

Severity

high

Summary

The "multipathd" daemon does not run

Explanation

The multipathd daemon checks for failed paths. In this case, the multipath map will be reconfigured to resume its maximum performance and redundancy.

To manually check if the "multipathd" service is running or not, enter the following command:

For Linux systems using chkconfig (for example: SLES11, RHEL6), issue:

# service multipathd status

For Linux systems using systemctl (for example: Fedora 19), issue:

# systemctl status multipathd.service

Solution

Start the "multipathd" service if it is in a stopped state.

For Linux systems using chkconfig (for example: SLES11, RHEL6), issue:

# service multipathd start

For Linux systems using systemctl (for example: Fedora 19), issue:

# systemctl start multipathd.service

Reference

Refer to the man pages of "systemctl" and "service" command.

56. Health check "storage_mp_zfcp_redundancy" (back to top)

Component

storage/multipath

Title

Check whether each disk is accessible via two or more host ports and two or more target ports (WWPNs)

Description

If all paths to a disk go through a single host or a single target port, this host port or target port is a single point of failure for the access to that device. Each disk in a multipath setup needs to be checked for a single point of failure.

Dependencies

sys_platform=s390 or sys_platform=s390x

Authors

"Manik Bajpai <manibajp@in.ibm.com>"

Exception "single_point_of_failure"

Severity

high

Summary

Some disks lack redundancy

Explanation

The following "WWID" numbers represent the disks which have a single point of failure.

WWID CHPID Device bus-ID WWPN

&spof_info_table;

NOTE: The "-" character in the preceding table indicates, that no single point of failure for that particular field type was found.

In case of any failure the corresponding disk is inaccessible.

To manually verify the problem, list all paths to a disk. Issue the command:

# multipath -l

This command lists all device names "/dev/sdxx" followed by their SCSI host, SCSI channel, SCSI target and SCSI LUN (HCTL). Each Linux device node (or SCSI HCTL) represents a path (consisting of host, port and LUN information). That path can be inspected with the command:

# lszfcp -D

None of the devices listed by the "multipath -l" command may contain a host port or a target port that is a single point of failure. Therefore, two points of failures need to be observed:

  1. The single WWPN.

  2. The single CHPID.

Solution

Configure additional paths for all disks that are connected via a single point of failure. This may require the usage of additional hardware, for example FCP channels, or target ports (WWPNs).

Reference

For more information on configuring FCP devices and multipathing, see "Device Drivers, Features, and Commands".

You can obtain this publication from:

http://www.ibm.com/developerworks/linux/linux390/distribution_hints.html

57. Health check "tty_console_getty" (back to top)

Component

terminal

Title

Spot getty programs on the /dev/console device

Description

In Linux, /dev/console is a generic device node that, depending on the environment and setup, is mapped to one of the available terminal devices (TTY). This terminal device is then represented by its own, specific device node and by /dev/console. If getty programs are configured for both device nodes, they interfere with each other, so that users cannot log in.

Dependencies

n/a

Authors

Hendrik Brueckner <brueckner@linux.vnet.ibm.com>

Exception "getty_on_console"

Severity

medium

Summary

A getty program runs on the /dev/console device

Explanation

The /dev/console device is a generic output device to which the Linux kernel writes messages. Depending on the environment and setup, /dev/console is mapped to one of the available terminal devices (TTY). This terminal device is then represented by its own, specific device node and by /dev/console.

You enable user logins by configuring a getty program for a terminal device node. If getty programs are configured for two device nodes that both map to the same terminal device, the getty programs interfere with each other, so that users cannot log in.

With the console= kernel parameter you can control to which terminal device /dev/console is mapped. With a getty program on /dev/console, changing this mapping can easily result in blocked user logins.

A process for console with a getty or login program indicates that a getty program is configured for /dev/console. On a running Linux instance, issue "ps -ef" to see the details for the current processes.

Solution

Modify your system boot configuration to avoid starting a getty program on the /dev/console device. Instead, configure getty programs on terminal devices that are available in your environment.

Reference

See the documentation for your system initialization. For example, if your distribution uses the SysV approach, see the inittab man page, or if your distribution uses Upstart, see the init man page of section 5.

58. Health check "tty_console_log_level" (back to top)

Component

terminal

Title

Check for current console_loglevel

Description

This check examines if appropriate console_loglevel is set. console_loglevel determines the severity of messages that needs to go to the console. If appropriate console_loglevel is not set then the user might miss some important messages.

Dependencies

n/a

Authors

Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>

Parameter "log_level"

Description

Log level below which to raise an exception. Valid values are integers in the range 1 to 8

List of log levels are:

KERN_EMERG     0    /*system is unusable            */
KERN_ALERT     1    /*action must be taken immediately    */
KERN_CRIT      2    /*critical conditions            */
KERN_ERR       3    /*error conditions            */
KERN_WARNING   4    /*warning conditions            */
KERN_NOTICE    5    /*normal but significant condition    */
KERN_INFO      6    /*informational                */
KERN_DEBUG     7    /*debug-level messages            */

Default value

4

Exception "low_loglevel"

Severity

medium

Summary

Current console_loglevel set is low (&console_loglevel;)

Explanation

When the console_loglevel is set to a low value, user might miss out important messages which needs user attention and also it helps user to understand what went wrong in the system when a system crash occurs.

Current value for console_loglevel can be verified by the following command:

sysctl -a | grep printk

You will see four values. First value represents the console_log_level.

Conventional meaning of the loglevels are:

KERN_EMERG     0       /*system is unusable                    */
KERN_ALERT     1       /*action must be taken immediately      */
KERN_CRIT      2       /*critical conditions                   */
KERN_ERR       3       /*error conditions                      */
KERN_WARNING   4       /*warning conditions                    */
KERN_NOTICE    5       /*normal but significant condition      */
KERN_INFO      6       /*informational                         */
KERN_DEBUG     7       /*debug-level messages                  */

console_loglevel: messages with a higher priority than current console_loglevel will be printed to the console. Lower number has a higher priority.

Solution

Its advisable to have console_loglevel set atleast to 4 (KERN_WARNING) to receive important messages.

To set the current console_loglevel in the /etc/sysctl.conf file:

kernel.printk = 4 4 1 7

This setting becomes active the next time the Linux instance is booted.

To temporarily set the current console_log_level issue the following command:

sysctl -w kernel.printk="4 4 1 7"

Reference

See the man pages of the "sysctl" command and of the "proc" filesystem.

59. Health check "tty_devnodes" (back to top)

Component

terminal

Title

Detect terminals with multiple device nodes

Description

This check detects terminals that are represented by more than one device node. If getty programs are configured for two device nodes that both map to the same terminal, the getty programs interfere with each other, so that users cannot log in.

Dependencies

n/a

Authors

Hendrik Brueckner <brueckner@linux.vnet.ibm.com>

Exception "tty_has_multiple_nodes"

Severity

medium

Summary

One or more terminals map to multiple device nodes: &var_node_list;

Explanation

Device nodes for terminals are automatically created, for example, by udev. A device is identified by a major and minor device number that corresponds to the device driver and a device node through which programs can access the device.

The following terminals are accessible through multiple device nodes:

&var_node_table;

The standard node is based on the device name used by the terminal device driver. See /proc/tty/drivers for a mapping of standard nodes and terminal device drivers.

Using different device nodes to access the same terminal device might cause login failures, for example, if a getty program is started on multiple device nodes at the same time.

Solution

Only access the terminal through one device node. Use the node that matches the name used by the device driver.

Also check the installation and configuration of the Linux instance. For example, there can be static device nodes in addition to those created by udev. Also see /etc/inittab or Upstart jobs to verify the configuration of getty programs.

Reference

See "Device Drivers, Features, and Commands". You can find this publication at

http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

60. Health check "tty_hvc_iucv" (back to top)

Component

terminal/hvc

Title

Confirm that all available z/VM IUCV HVC terminals are enabled for logins

Description

The z/VM IUCV Hypervisor Console (HVC) device driver can manage up to eight HVC terminals that can be enabled for user logins. The number of HVC terminals is specified through a kernel parameter.

HVC terminals that are not enabled for logins serve no purpose and cannot provide access to the Linux instance in emergencies.

This check confirms that all available HVC terminals are enabled for user logins.

Dependencies

sys_platform=s390 or sys_platform=s390x

sys_hypervisor=ZVM

Authors

Hendrik Brueckner <brueckner@linux.vnet.ibm.com>

Parameter "min_hvc_iucv"

Description

Specifies the minimum number of HVC terminal devices that must be available. This in an integer number in the range from 1 to 8.

Default value

1

Exception "too_few_ttys"

Severity

low

Summary

Number of z/VM IUCV HVC terminals are below the required minimum

Explanation

The z/VM IUCV Hypervisor Console (HVC) device driver provides you with access to the Linux instance using the z/VM Inter-User Communication Vehicle (IUCV). With this setup, you can log in to the Linux instance with no external network connection.

The current setup has &hvc_iucv_avail; HVC terminal devices which is below the required minimum of &param_min_hvc_iucv; devices.

HVC terminals that are not available cannot provide access to the Linux instance in emergencies.

Solution

Use the hvc_iucv= kernel parameter to increase the number of z/VM IUCV HVC terminals. Alternatively, reduce the min_hvc_iucv check parameter.

Reference

For information about HVC terminals and how to set them up see:

  • "How to Set up a Terminal Server Environment"

  • "Device Drivers, Features, and Commands"

You can obtain this publication from

http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

See also the man pages for the "iucvconn" command and the "hvc_iucv" device driver.

Exception "unused_ttys"

Severity

medium

Summary

These z/VM IUCV terminals are not enabled for user logins: &hvc_short_list;

Explanation

The z/VM IUCV Hypervisor Console (HVC) device driver provides you with access to the Linux instance using the z/VM Inter-User Communication Vehicle (IUCV). With this setup, you can log in to the Linux instance with no external network connection.

The current setup has &num_hvc_iucv; HVC terminal devices that are managed by the z/VM IUCV HVC device driver. The following &num_hvc_req; HVC terminals are not enabled for user logins:

&hvc_dev_list;

HVC terminals that are not enabled for logins serve no purpose and cannot provide access to the Linux instance in emergencies.

Solution

Enable each HVC terminal for user logins by starting a getty program on the terminal device. Alternatively, you can use the hvc_iucv= kernel parameter to reduce the number of z/VM IUCV HVC terminals.

Reference

For information about HVC terminals and how to set them up see:

  • "How to Set up a Terminal Server Environment"

  • "Device Drivers, Features, and Commands"

You can obtain this publication from

http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

See also the man pages for the "iucvconn" command and the "hvc_iucv" device driver.

61. Health check "tty_idle_terminals" (back to top)

Component

terminal

Title

Identify idle terminals

Description

Identify terminals on which users are logged in but are not active. Each logged-in user occupies a terminal that could be used by another user.

Dependencies

n/a

Authors

Hendrik Brueckner <brueckner@linux.vnet.ibm.com>

Parameter "idle_time"

Description

Specifies the maximum idle time to be tolerated. Valid values are positive integers followed by d, h, m, or s for days, hours, minutes, or seconds.

If a user exceeds this idle time, an exception message is issued.

Default value

1d

Parameter "tty"

Description

A blank-separated list of terminals. The check identifies idle users who are logged in through the specified terminals. If the list is empty, all terminals are checked.

Terminals are specified by their device node without the leading /dev/. Use an asterisk (*) to match any string of characters. For example, "ttyS3 hvc*" matches /dev/ttyS3, /dev/hvc0, /dev/hvc1, ...

Default value

n/a

Exception "idle_ttys"

Severity

low

Summary

These terminals are idle: &short_list;

Explanation

One or more terminals are occupied by users who do not work with the system. These terminals are not available for logins by other users, for example, in emergencies.

On these terminals, the specified idle time, &param_idle_time;, has been exceeded:

&long_list;

You can run the "w" command to display user IDs and their idle times.

Solution

Log off the idle user IDs on their terminal.

Reference

See the man page of the "w" command.

62. Health check "tty_idle_users" (back to top)

Component

terminal

Title

Identify idle users

Description

Identify users who are logged in but are not active. Each logged-in user occupies a terminal that could otherwise be used by another user.

Dependencies

n/a

Authors

Hendrik Brueckner <brueckner@linux.vnet.ibm.com>

Parameter "idle_time"

Description

Specifies the maximum idle time to be tolerated. Valid values are positive integers followed by d, h, m, or s for days, hours, minutes, or seconds.

If a user exceeds this idle time, an exception message is issued.

Default value

1d

Parameter "tty"

Description

A blank-separated list of terminals for which idle users are identified. Terminals are specified by their device node without the leading /dev/. If the list is empty, all terminals are checked.

Default value

n/a

Parameter "users"

Description

A blank-separated list of user IDs for which the idle times are checked. If the list is empty, all users are checked.

Default value

root

Exception "idle_users"

Severity

low

Summary

These users are idle: &short_list;

Explanation

One or more of the users who are logged in to the Linux instance do not work with the system. The terminals these users occupy are not available for logins by other users, for example, in emergencies.

For these users the idle time has exceeded the specified "&param_idle_time;":

&long_list;

You can run the "w" command to display user IDs and their idle times.

Solution

Log off the idle user IDs on their terminal.

Reference

See the man page of the "w" command.

63. Health check "tty_usage" (back to top)

Component

terminal

Title

Identify unused terminals (TTY)

Description

Verify that terminal (TTY) devices are used, for example, by login programs.

Terminal devices are intended to provide a user interface to a Linux instance. Without an associated program, a terminal device does not serve this purpose.

Dependencies

n/a

Authors

Hendrik Brueckner <brueckner@linux.vnet.ibm.com>

Parameter "exclude_tty"

Description

A list of blank-separated terminal devices to be exempt from this check, for example, because they are deliberately unused.

Terminals are specified by their device node without the leading /dev/. Use an asterisk (*) to match any string of characters. For example, "ttyS3 hvc*" excludes /dev/ttyS3, /dev/hvc0, /dev/hvc1, ...

Default value

tty

Exception "unused_ttys"

Severity

medium

Summary

These terminals are unused: &var_short_list;

Explanation

There are one or more unused terminal devices. Terminal devices are intended to provide a user interface to a Linux instance. Without an associated program, a terminal device does not serve this purpose.

These terminal devices are unused:

&var_tty_list;

To confirm that no program is configured for a terminal device, issue "ps -ef |grep <terminal>". Where <terminal> specifies the terminal device node without the leading /dev/.

Solution

Configure a getty program for each unused terminal. Depending on your distribution, you might have to create an inittab entry or an Upstart job. For details, see the documentation that is provided with your distribution.

If you want to accept unused terminals, add them to the "exclude_tty" check parameter to suppress this warning in the future.

Reference

For general information about terminals, see "Device Drivers, Features, and Commands". You can obtain this publication from:

http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

For more specific information, see the documentation that is provided with your distribution. Also see the man page of the "ps" command.

64. Health check "zfcp_hba_npiv_active" (back to top)

Component

zfcp/hba

Title

Check whether N_Port ID Virtualization (NPIV) is active for all eligible FCP devices

Description

This check identifies the FCP devices for which N_Port ID Virtualization is possible but not active. To use NPIV, the FCP devices must be attached to a switch. The switch must support NPIV and the System z type must be z9 or later.

With NPIV a single FCP port can register multiple worldwide port names (WWPN) with a fabric name server. Each registered WWPN is assigned a unique N_Port ID. NPIV requires more storage area network (SAN) resources because of the additional virtual WWPNs. Linux can save resources by only seeing a limited set of SCSI devices.

Dependencies

sys_platform=s390 or sys_platform=s390x

Authors

Nageswara R Sastry <nasastry@in.ibm.com>

Exception "no_npiv"

Severity

medium

Summary

The following FCP devices are not configured with NPIV: &adapter_summ;

Explanation

System z FCP channels require a FICON Express adapter. FCP channels can be shared by multiple LPARs. Each port on the adapter is assigned a permanent 64-bit WWPN by the manufacturer; this is used at Fabric Login (FLOGI).

Without the NPIV feature, each operating system image that has an FCP port is identified to the fabric by the permanent WWPN of the port. In this case, all operating system images have the same access rights in the fabric. The permanent WWPN of the port determines: Zone membership for all images sharing the port Logical Unit Number (LUN) access rights for all images sharing the port

With the NPIV feature, the Service Element (SE) creates new WWPNs for the FCP port at FLOGI. A unique WWPN is then assigned to each operating system image sharing the port. The generated NPIV WWPN is registered with the fabric switch and uniquely identifies each image for fabric zoning and LUN masking.

NPIV support is available on System z9 servers and above, FICON Express 2 adapter and above.

The following FCP devices not configured with NPIV:

Adapter port name

&adapter;

To manually check whether NPIV support is available, use the following commands.

To check if your System z hardware supports NPIV, issue:

# cat /proc/sysinfo

The output "Type: <value>", describes the type of System z. For Example "2094" is the "System z type z9".

To check whether the connected port supports NPIV, issue:

# lszfcp -a

See the values for "port_name" and "permanent_port_name" at each port. If both values are the same NPIV is not enabled. If the values are different NPIV is enabled.

Solution

Enable NPIV on SE for the corresponding CHPIDs on LPAR and in the switch adjacent to the FCP device. If you are using the zoning or LUN masking ensure that the new NPIV enabled FCP devices are handled correctly.

Reference

For more information see "redp4125: Introducing N_Port Identifier Virtualization for IBM System z9". You can obtain this publication from:

http://www.redbooks.ibm.com

65. Health check "zfcp_hba_recovery_failed" (back to top)

Component

zfcp/hba

Title

Check if FCP device recovery failed

Description

A failed FCP device recovery indicates a problem in the same. As a result, none of the resources attached through this FCP device are available. This health check detects the failure, when the recovery of the failed FCP device was initiated, and failed.

Dependencies

sys_platform=s390 or sys_platform=s390x

Authors

Manik Bajpai <manibajp@in.ibm.com>

Exception "hba_not_usable"

Severity

high

Summary

The recovery of these FCP devices failed: &failed_fcp_devices;

Explanation

The FCP devices with the following bus-IDs are unable to recover from failures and cannot be used:

Device bus-ID

&failed_fcp_devices_table;

To manually verify if the problem still exists read the content of this file:

# cat /sys/bus/ccw/drivers/zfcp/<device_bus_id>/failed

An output of "1" signifies that the problem still exists. An output of "0" signifies that the problem does not exist.

Solution

If the recovery failed, please perform these manual steps:

  1. Check the previous error kernel messages for the same FCP device to find the cause of the problem.

  2. If the recovery failed write "0" to the failed attribute. Issue the following command at the shell prompt:

    # echo '0' > /sys/bus/ccw/drivers/zfcp/<device_bus_id>/failed
  3. Wait for 5 seconds.

  4. Issue the following command to ensure that all udev events are processed:

    # udevadm settle
  5. Check the value of failed attribute. For example, issue:

    # cat /sys/bus/ccw/drivers/zfcp/<device_bus_id>/failed

    If the value is "1" the recovery failed again, because the root cause resolution in step 1 was not sufficient.

Reference

For more information about the recovery of a failed FCP device, see "Device Drivers, Features, and Commands". You can obtain this publication from:

http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

For kernel messages see "Kernel Messages". You can obtain this publication from:

http://pic.dhe.ibm.com/infocenter/lnxinfo/v3r0m0/topic/com.ibm.linux.l0kmsg.doc/l0km_plugin_top.html

66. Health check "zfcp_hba_shared_chpids" (back to top)

Component

zfcp/hba

Title

Identify FCP devices that share channel-path identifiers (CHPIDs)

Description

A single FCP channel can be represented inside Linux by more than one CCW device. Such a configuration is possible but does not provide an increase in availability or I/O performance. Also extra FCP CCW devices waste FCP channel resources especially in an N_Port ID Virtualization (NPIV) setup. It can cause hardware limitations such as maximum number of open ports or open LUNs to be be reached faster.

Dependencies

sys_platform=s390 or sys_platform=s390x

Authors

Manik Bajpai <manibajp@in.ibm.com>

Parameter "check_offline"

Description

This parameter determines if the check is extended to offline FCP devices. Change this value to 1 if you want to also check offline FCP devices, 0 if you want to exclude offline devices from checking.

Default value

0

Exception "single_chpid"

Severity

high

Summary

There are multiple FCP devices using the same CHPID: &shared_hbas_info;

Explanation

The following FCP devices share the same CHPID.

CHPID Device bus-IDs

&shared_hbas_table;

This can cause a variety of issues:

  • FCP devices using the same CHPID defeat the purpose of multipathing and refuse the access of all devices attached through the FCP device using this CHPID.

  • Devices waste FCP channel resources and can cause hardware limitations. The maximum number of open ports or open LUNs will be reached faster.

To manually identify which FCP devices share the same CHPID, execute the following command:

# lscss -t 1732/03,1732/04

The following sample output shows all the configured FCP devices regardless of whether they are currently online or not.

Device   Subchan.  DevType CU Type Use  PIM PAM POM  CHPIDs
----------------------------------------------------------------------
0.0.3c00 0.0.0015  1732/03 1731/03 yes  80  80  ff   36000000 00000000
0.0.3c01 0.0.0016  1732/03 1731/03 yes  80  80  ff   36000000 00000000
0.0.3d00 0.0.0017  1732/03 1731/03 yes  80  80  ff   37000000 00000000
0.0.3d01 0.0.0018  1732/03 1731/03 yes  80  80  ff   37000000 00000000
0.0.3d02 0.0.0019  1732/03 1731/03      80  80  ff   37000000 00000000

This example shows that the FCP devices 0.0.3c00 and 0.0.3c01 share the same CHPID ="36", and the FCP devices 0.0.3d00, 0.0.3d01, and 0.0.3d02 share the same CHPID ="37".

Solution

The I/O configuration of Linux system must be changed so that each FCP device is represented by only one CCW device. This is done outside of Linux (LPAR or z/VM I/O configuration tools).

Reference

Please refer to the "lscss" man-page for more details on the lscss command.

67. Health check "zfcp_lun_configured_available" (back to top)

Component

zfcp/lun

Title

Ensure that all LUNs configured for persistence, which are accessible through an FCP adapter, are available

Description

Users configure LUNs when they intend to use them for I/O. If a configured LUN is not available, this indicates either a configuration error or a loss of connectivity.

Dependencies

sys_platform=s390 or sys_platform=s390x

Authors

"Manik Bajpai <manibajp@in.ibm.com>"

Exception "lun_unavailable"

Severity

high

Summary

One or more LUNs configured at the boot time for persistence, are not available

Explanation

There are certain zfcp-attached SCSI devices configured for availability across reboots, which are not available. These devices are:

Device bus-ID WWPN LUN

&missing_disk_table;

Possible reasons that the LUNs are not available can be:

  1. A problem in the configuration

  2. A loss of connectivity

  3. Unsupported syntax in the persistent configuration used

In order to manually verify the problem, perform the following steps:

On SUSE Linux Enterprise System:

  1. Check the corresponding "udev" rules.

  2. Read the content of the files in "51-zfcp-<device_bus_id>.rules" in the "/etc/udev/rules.d/" directory.

The structure of the content of the udev rule (wrapped for readability) is demonstrated below:

ACTION=="add", KERNEL=="rport-*", ATTR{port_name}=="<wwpn>",
    SUBSYSTEMS=="ccw", KERNELS=="<device_bus_id>",
    ATTR{[ccw/<device_bus_id>]<wwpn>/unit_add}="<fcp_lun>"

Example for the device bus-ID=0.0.3c00 (wrapped for readability)

/etc/udev/rules.d/51-zfcp-0.0.3c00.rules:
...
ACTION=="add", KERNEL=="rport-*",
    ATTR{port_name}=="0x500507630503c1ae", SUBSYSTEMS=="ccw",
    KERNELS=="0.0.3c00", ATTR{[ccw/0.0.3c00]0x500507630503c1ae/
    unit_add}="0x4020406000000000"
ACTION=="add", KERNEL=="rport-*",
    ATTR{port_name}=="0x500507630503c1ae", SUBSYSTEMS=="ccw",
    KERNELS=="0.0.3c00", ATTR{[ccw/0.0.3c00]0x500507630503c1ae/
    unit_add}="0x4020407000000000"
ACTION=="add", KERNEL=="rport-*",
    ATTR{port_name}=="0x500507630503c1ae", SUBSYSTEMS=="ccw",
    KERNELS=="0.0.3c00", ATTR{[ccw/0.0.3c00]0x500507630503c1ae/
    unit_add}="0x402040b600000000"
ACTION=="add", KERNEL=="rport-*",
    ATTR{port_name}=="0x500507630503c1ae", SUBSYSTEMS=="ccw",
    KERNELS=="0.0.3c00", ATTR{[ccw/0.0.3c00]0x500507630503c1ae/
    unit_add}="0x402040d600000000"

Example for the device bus-ID=0.0.3d18 (wrapped for readability)

/etc/udev/rules.d/51-zfcp-0.0.3d18.rules:
...
ACTION=="add", KERNEL=="rport-*",
    ATTR{port_name}=="0x500507630508c1ae", SUBSYSTEMS=="ccw",
    KERNELS=="0.0.3d18", ATTR{[ccw/0.0.3d18]0x500507630508c1ae/
    unit_add}="0x4020406000000000"
ACTION=="add", KERNEL=="rport-*",
    ATTR{port_name}=="0x500507630508c1ae", SUBSYSTEMS=="ccw",
    KERNELS=="0.0.3d18", ATTR{[ccw/0.0.3d18]0x500507630508c1ae/
    unit_add}="0x4020407000000000"
ACTION=="add", KERNEL=="rport-*",
    ATTR{port_name}=="0x500507630508c1ae", SUBSYSTEMS=="ccw",
    KERNELS=="0.0.3d18", ATTR{[ccw/0.0.3d18]0x500507630508c1ae/
    unit_add}="0x402040b600000000"
ACTION=="add", KERNEL=="rport-*",
    ATTR{port_name}=="0x500507630508c1ae", SUBSYSTEMS=="ccw",
    KERNELS=="0.0.3d18", ATTR{[ccw/0.0.3d18]0x500507630508c1ae/
    unit_add}="0x402040d600000000"

Note: In case you want to use the auto LUN scan feature (available since SLES11P2) but the LUNs are not available as expected, SLES11 requires the kernel parameter

zfcp.allow_lun_scan=1

However, note that this check would not monitor the auto LUN scan feature.

See also:

http://www.novell.com/support/kb/doc.php?id=7012700

On Red Hat Enterprise Linux:

For zfcp-attached SCSI devices not required to mount the root filesystem, for example data volumes, or tape drives, or tape libraries, check the output of this command:

# cat /etc/zfcp.conf

The structure of the output is demonstrated below:

<device_bus_id> <wwpn>  <fcp_lun>

For Example:

0.0.3c00 0x500507630503c1ae  0x4020406000000000
0.0.3c00 0x500507630503c1ae  0x4020407000000000
0.0.3d18 0x500507630508c1ae  0x4020406000000000
0.0.3d18 0x500507630508c1ae  0x4020407000000000

For zfcp-attached SCSI disks required to mount the root filesystem, and only those, rd_ZFCP entries are used with Red Hat Enterprise Linux 6 as part of the kernel parameters, for example via the "zipl.conf". To find out the list of disks meant to be persistent, see the content of "/proc/cmdline". The output, which is a single line but wrapped around for readability, looks like:

root=/dev/mapper/vg_devel1-lv_root rd_ZFCP=0.0.3c00,0x500507630503c1ae,
       0x402040b600000000 rd_ZFCP=0.0.3c00,0x500507630503c1ae,
       0x402040d600000000 rd_ZFCP=0.0.3d18,0x500507630508c1ae,
       0x402040b600000000 rd_ZFCP=0.0.3d18,0x500507630508c1ae,
       0x402040d600000000 rd_LVM_LV=vg_devel1/lv_root rd_NO_LUKS
       rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16
       KEYTABLE=us cio_ignore=all,!0.0.0009

On either system, SUSE Linux Enterprise System or Red Hat Enterprise Linux, check the output of the command "lszfcp -D" and look for the first column of the output.

0.0.3c00/0x500507630503c1ae/0x4020406000000000
0.0.3c00/0x500507630503c1ae/0x4020407000000000
0.0.3c00/0x500507630503c1ae/0x402040d600000000
0.0.3d18/0x500507630508c1ae/0x4020406000000000
0.0.3d18/0x500507630508c1ae/0x402040b600000000
0.0.3d18/0x500507630508c1ae/0x402040d600000000

The list shows all available zfcp-attached SCSI devices. Comparing this list with the persistent configuration in the above example, the SCSI devices with the parameters <device_bus_id/wwpn/lun>=<0.0.3d18/0x500507630508c1ae/0x4020407000000000> and <0.0.3c00/0x500507630503c1ae/0x402040b600000000> are not available.

Solution

If a LUN is configured but not available, check if the correct WWPN or LUN was used. If that is the case, check the storage area network (SAN) zoning and storage server definitions and state. Review the "syslog" messages, and the command history for troubleshooting.

Reference

For more information about the configuration of FCP LUNs and troubleshooting, see "Device Drivers, Features, and Commands". You can obtain this publication from:

http://www.ibm.com/developerworks/linux/linux390/distribution_hints.html

68. Health check "zfcp_lun_recovery_failed" (back to top)

Component

zfcp/lun

Title

Identify if recovery of a zFCP LUN failed

Description

A failed zFCP LUN recovery indicates a problem in the storage device. As a result, the device is not available for I/O.

Dependencies

sys_platform=s390 or sys_platform=s390x

Authors

Nageswara R Sastry <nasastry@in.ibm.com>

Exception "lun_not_usable"

Severity

high

Summary

There are unusable zFCP LUNs &lun_summ;

Explanation

Some zFCP LUNs are in a state indicating that there have been problems while trying to recover from an error.

The following zFCP LUNs indicate that recovery failed:

zFCP LUN Device_ID WWPN Failed

&lun;

To manually check the state of zFCP LUN, check the value of the associated "failed" sysfs attribute.

For example, issue:

# cd /sys/bus/ccw/drivers/zfcp/<device_bus_id>/<wwpn>/
# cat <lun>/failed

where

  • <device_bus_id> - is the device bus-ID that looks like x.x.xxxx

  • <wwpn> - is the world wide port number that looks like 0x<16 digits>

  • <lun> - is the logical unit number that looks like 0x<16 digits>

Solution

Manually trigger LUN recovery by echoing "0" into the corresponding "failed" sysfs attribute.

For example, issue:

# cd /sys/bus/ccw/drivers/zfcp/<device_bus_id>/<wwpn>/
# echo 0 > <lun>/failed

Search for Storage Server errors and resolve them using the documentation of your Storage Server.

Reference

For more information about recovering a failed FCP device, see "Device Drivers, Features, and Commands". You can obtain this publication from

http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html

69. Health check "zfcp_target_port_recovery_failed" (back to top)

Component

zfcp/target port

Title

Check if the recovery of a target port failed

Description

A failed target port recovery indicates a problem in the FCP channel, the fibre channel fabric or link, or in the storage server. As a result, none of the LUNs attached through this port are available. This health check detects the failure when the recovery of the failed target port was initiated by the zfcp device driver, and the recovery failed.

Dependencies

sys_platform=s390 or sys_platform=s390x

Authors

Manik Bajpai <manibajp@in.ibm.com>

Exception "port_not_usable"

Severity

high

Summary

The recovery of these target ports failed: &failed_ports_summ;

Explanation

The target ports with the following device bus-IDs and worldwide port names (WWPN), are unable to recover from failures and cannot be used:

Device bus-ID WWPN

&failed_ports_table;

To manually verify if the problem still exists, read the content of this file:

# cat /sys/bus/ccw/drivers/zfcp/<device_bus_id>/<wwpn>/failed

An output of "1" signifies that the problem still existis. An output of "0" signifies that the problem does not exist.

Solution

If the recovery failed, please perform these manual steps:

  1. Check for zfcp kernel messages and the FCP hardware.

  2. Verify that the WWPN is correct.

  3. Check the fibre channel fabric for errors related to the WWPN.

  4. Check the storage target for failed port login attempts.

  5. If the root cause is resolved, write "0" to the failed attribute. Issue the following command at the shell prompt:

    # echo 0 > /sys/bus/ccw/drivers/zfcp/<device_bus_id>/<wwpn>/failed
  6. Wait for 5 seconds.

  7. Issue the following command to ensure that all udev events have been processed:

    # udevadm settle
  8. Check the value of the failed attribute. For example, issue:

    # cat /sys/bus/ccw/drivers/zfcp/<device_bus_id>/<wwpn>/failed

    If the value is "1" the recovery failed again, because the root cause resolution in steps 1-4 was not sufficient.

Reference

For more information about the recovery of a failed target port, see "Device Drivers, Features, and Commands". You can obtain this publication from:

http://www.ibm.com/developerworks/linux/linux390/distribution_hints.html

For kernel messages see "Kernel Messages". You can obtain this publication from:

http://pic.dhe.ibm.com/infocenter/lnxinfo/v3r0m0/topic/com.ibm.linux.l0kmsg.doc/l0km_plugin_top.html

70. Health check "zvm_priv_class" (back to top)

Component

zvm

Title

Check the privilege classes of the z/VM guest virtual machine on which the Linux instance runs

Description

This check examines the z/VM privilege classes of the current z/VM guest virtual machine and compares them with the permitted privilege classes. The permitted privilege classes are provided by the permitted_privclass parameter.

Higher privilege classes than the permitted ones might allow operations which can inadvertently or maliciously affect the security and availability of other z/VM guest virtual machines running in the same z/VM instance. Generally, higher privilege classes should be assigned sparingly and only to trusted z/VM user IDs.

Dependencies

sys_platform=s390 or sys_platform=s390x

sys_hypervisor=ZVM

Authors

Nageswara R Sastry <nasastry@in.ibm.com>

Parameter "check_for"

Description

Privilege classes to check: privilege classes effective at run-time (currently), privilege classes permanently defined in the user directory (directory), or both (currently, directory).

Default value

Currently, Directory

Parameter "permitted_privclass"

Description

Privilege classes permitted for z/VM guest virtual machines. Valid values are lists of letters in the range A to Z and integers in the range 1 to 6.

Example:

ABCD12

Default value

G

Exception "default_privileges_exceeded"

Severity

medium

Summary

The privilege classes '&sum_dir_extrapriv;', which are permanently defined in the z/VM user directory of the z/VM guest, exceed the maximum defined permission '&param_permitted_privclass;'.

Explanation

The privilege classes of the currently active z/VM guest virtual machine, which are permanently defined in the z/VM user directory, exceed the maximum defined permission. Higher privilege classes than the permitted ones might allow operations which can inadvertently or maliciously affect the security and availability of other z/VM guest virtual machines running in the same z/VM instance.

The following entry has higher privileges: Directory: &dir_extrapriv;

Perform these steps to verify the privileges: From Linux,

  1. Load the kernel module named "vmcp", if not already loaded or built-in:

     modprobe vmcp
  2. Query the privilege class:

     vmcp q privclass

From the console of the z/VM guest virtual machine, Query the privilege class:

 cp q privclass

Solution

To change the privilege class in the z/VM user directory, either edit the user privilege class entries manually, or, if DirMaint is installed, use the "DirMaint" commands to modify the privilege classes.

Reference

For more information about privilege classes, see "z/VM: CP Commands and Utilities Reference". For more information about modifying privilege class entries in the z/VM user directory, see "z/VM: CP Planning and Administration" and the "Directory Maintenance Facility for z/VM" library. You can obtain these publications from

http://www.vm.ibm.com/pubs/

and

http://www.vm.ibm.com/library/

Exception "running_privileges_exceeded"

Severity

medium

Summary

The currently active privilege classes '&sum_cur_extrapriv;' exceeds the maximum defined permission '&param_permitted_privclass;'.

Explanation

The run-time privilege classes of the currently active z/VM guest virtual machine exceeds the maximum defined permission. Higher privilege classes than the permitted ones might allow operations which can inadvertently or maliciously affect the security and availability of other z/VM guest virtual machines running in the same z/VM instance.

The following entry has higher privileges: Currently: &cur_extrapriv;

Perform these steps to verify the privileges: From Linux,

  1. Load the kernel module named "vmcp", if not already loaded or built-in:

     modprobe vmcp
  2. Query the privilege class:

     vmcp q privclass

From the console of the z/VM guest virtual machine, Query the privilege class:

 cp q privclass

Solution

Use the z/VM "SET PRIVCLASS" command to assign a permitted privilege class to the z/VM guest virtual machine. Generally, the privilege class should not be higher than class G.

Reference

For more information about privilege classes, see "z/VM: CP Commands and Utilities Reference". You can obtain this publication from

http://www.vm.ibm.com/pubs/