您当前位置: 首页 » 数据中心 »

Nexus

分类目录归档: Nexus

Cisco prime infrastructure /opt空间不足故障处理

最近遇到一例Cisco PI磁盘爆满导致其PI服务停止的故障,故障相关描述如下:

其平台安装在VMware上面,现场工程师在对其PI平台硬重启之后发现其WEB页面无法正常登陆。但是可以通过其命令行登陆进去。通过命令行查看其PI服务如下所示:

admin/admin# show application status NCS

Health Monitor is running, with an error. ( [Role] Primary [State] HA not Configured )
initHealthMonitor(): can not start DB
Database server is stopped
Ftp Server is running
Tftp Server is running
Matlab Server is running
Matlab Server Instance 1 is running
NMS Server is stopped.
CNS Gateway with port 11011 is down
CNS Gateway SSL with port 11012 is down
CNS Gateway with port 11013 is down
CNS Gateway SSL with port 11014 is down
Plug and Play Gateway Broker with port 61617 is down
Plug and Play Gateway config, image and resource are down on https
Plug and Play Gateway is stopped.
SAM Daemon is stopped.
DA Daemon is stopped.

如上所示:其PI的DB等服务未正常启动。楼主尝试过ncs stop 之后在ncs start 还是没有把相关服务给拉起来。

admin/admin# ncs stop

Stopping Prime Infrastructure…

This may take a few minutes…
Database is not running.
Stopping remoting: Ftp Server
Remoting ‘Ftp Server’ stopped successfully.
Stopping remoting: Tftp Server
Remoting ‘Tftp Server’ stopped successfully.
Stopping remoting: Matlab Server
Remoting ‘Matlab Server’ stopped successfully.
Stopping remoting: Matlab Server Instance 1
Remoting ‘Matlab Server Instance 1’ stopped successfully.
NMS Server is not running!.

Prime Infrastructure successfully shutdown.
Plug and Play Gateway is being shut down….. Please wait!!!

Stop of Plug and Play Gateway Completed!!
SAM daemon process id does not exist
DA daemon process id does not exist
Stopping strongSwan IPsec…

admin/admin# ncs start

Starting Prime Infrastructure…

This may take a while (10 minutes or more) …

Failure during Prime Infrastructure startup. Check launchout.log for details.

Starting strongSwan 5.0.1 IPsec [starter]…
Completed in 64 seconds

admin/admin# show application status NCS

Health Monitor is running, with an error. ( [Role] Primary [State] HA not Configured )
initHealthMonitor(): can not start DB
Database server is stopped
Ftp Server is running
Tftp Server is running
Matlab Server is running
Matlab Server Instance 1 is Stopped
NMS Server is stopped.
CNS Gateway with port 11011 is down
CNS Gateway SSL with port 11012 is down
CNS Gateway with port 11013 is down
CNS Gateway SSL with port 11014 is down
Plug and Play Gateway Broker with port 61617 is down
Plug and Play Gateway config, image and resource are down on https
Plug and Play Gateway is stopped.
SAM Daemon is stopped.
DA Daemon is stopped.

尝试着查看了下PI disk的使用情况,发现其/opt空间的使用率已经爆满。

admin/admin# show disks

temp. space 2% used (36316 of 1967952)
disk: 22% used (9067892 of 44628400)

Internal filesystems:
warning – /opt is 100% used (198161376 of 209463592)

admin/admin# dir

Directory of disk:/

20 Sep 20 2015 16:02:59 crash
4096 Aug 10 2016 03:30:00 defaultRepo/
4096 Apr 15 2015 14:46:39 ftp/
16384 Dec 12 2014 07:46:41 lost+found/
4096 Apr 15 2015 14:57:21 sftp/
4096 Apr 15 2015 14:46:35 ssh/
4096 Apr 15 2015 14:46:35 telnet/
4096 Feb 08 2017 22:03:28 tftp/

Usage for disk: filesystem
9285521408 bytes total used
34055086080 bytes free
45699481600 bytes available

通过查阅官方的相关文档,发现该分区存放着数据库日志,可以使用ncs clearup 来清理旧的日志和备份记录。具体记录如下:

admin/admin# ncs cleanup
***************************************************************************
!!!!!!! WARNING !!!!!!!
***************************************************************************
The clean up can remove all files located in the backup staging directory.
Older log files will be removed and other types of older debug information
will be removed
***************************************************************************
Do you wish to continue? ([NO]/yes) yes
***************************************************************************
!!!!!!! DATABASE CLEANUP WARNING !!!!!!!
***************************************************************************
Cleaning up database will stop the server while the cleanup is performed.
The operation can take several minutes to complete
***************************************************************************
Do you wish to cleanup database? ([NO]/yes) yes
***************************************************************************
!!!!!!! USER LOCAL DISK WARNING !!!!!!!
***************************************************************************
Cleaning user local disk will remove all locally saved reports, locally
backed up device configurations. All files in the local FTP and TFTP
directories will be removed.
***************************************************************************
Do you wish to cleanup user local disk? ([NO]/yes) yes
===================================================
Starting Cleanup: Tue Feb 21 19:41:41 CST 2017
===================================================
{Tue Feb 21 19:41:45 CST 2017} Removing all files in backup staging directory
{Tue Feb 21 19:41:45 CST 2017} Removing all Matlab core related files
{Tue Feb 21 19:41:45 CST 2017} Removing all older log files
{Tue Feb 21 19:41:53 CST 2017} Cleaning older archive logs
{Tue Feb 21 19:42:02 CST 2017} Cleaning database backup and all archive logs
{Tue Feb 21 19:42:02 CST 2017} Cleaning older database trace files
{Tue Feb 21 19:42:02 CST 2017} Removing all user local disk files
{Tue Feb 21 19:42:17 CST 2017} Cleaning database
{Tue Feb 21 19:42:21 CST 2017} Stopping database
{Tue Feb 21 19:42:22 CST 2017} Starting database
{Tue Feb 21 19:42:59 CST 2017} Starting database clean
{Tue Feb 21 19:42:59 CST 2017} Completed database clean
{Tue Feb 21 19:42:59 CST 2017} Stopping database
===================================================
Completed Cleanup
Start Time: Tue Feb 21 19:41:41 CST 2017
Completed Time: Tue Feb 21 19:43:21 CST 2017
===================================================

在清楚出老旧日志和备份记录之后重启相关服务之后其相关服务已经running了,但是查看发现其/opt空间还是不够。PI管理的设备比较多,估计很快几天时间又爆满了。

admin/admin# show disks

temp. space 2% used (36392 of 1967952)
disk: 1% used (184360 of 44628400)

Internal filesystems:
warning – /opt is 97% used (191569444 of 209463592)

最后实在无奈只能增加其硬盘空间。直接先使用half停止PI的相关服务,之后在VMware中添加磁盘(挂盘之后VMware启动空间自动合并的)其磁盘合并按照85%和15%的比例,其中85%给/opt 目录。15%分给其他目录。

admin/admin# halt
Save the current ADE-OS running configuration? (yes/no) [yes] ? yes
Generating configuration…
Saved the ADE-OS running configuration to startup successfully
Continue with shutdown? [y/n] yes

Broadcast message from root (pts/2) (Tue Feb 21 20:31:23 2017):

The system is going down for system halt NOW!
Server is shutting down…

挂盘之后,在查看其磁盘使用情况,已经有足够的空间了。

admin/admin# show disks

temp. space 2% used (36344 of 1967952)
disk: 1% used (188276 of 90305368)

Internal filesystems:
all internal filesystems have sufficient free space

通过root账户进入linux底层查看目录的磁盘情况。(注意,在PI中默认的root账户为disable状态,需要enable才能su root )

ade # fdisk -l

Disk /dev/sda: 322.1 GB, 322122547200 bytes
255 heads, 63 sectors/track, 39162 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/sda1 * 1 64 512000 83 Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2 64 77 102400 83 Linux
Partition 2 does not end on cylinder boundary.
/dev/sda3 77 38245 306584576 8e Linux LVM

Disk /dev/sdb: 322.1 GB, 322122547200 bytes
255 heads, 63 sectors/track, 39162 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/sdb1 1 39163 314571776 83 Linux

ade # df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/smosvg-rootvol
3.8G 461M 3.2G 13% /
/dev/mapper/smosvg-varvol
3.8G 505M 3.1G 14% /var
/dev/mapper/smosvg-optvol
447G 183G 241G 44% /opt
/dev/mapper/smosvg-tmpvol
1.9G 36M 1.8G 2% /tmp
/dev/mapper/smosvg-usrvol
6.6G 1.2G 5.2G 19% /usr
/dev/mapper/smosvg-recvol
93M 5.6M 83M 7% /recovery
/dev/mapper/smosvg-home
93M 5.6M 83M 7% /home
/dev/mapper/smosvg-storeddatavol
9.5G 151M 8.9G 2% /storeddata
/dev/mapper/smosvg-altrootvol
93M 5.6M 83M 7% /altroot
/dev/mapper/smosvg-localdiskvol
87G 184M 82G 1% /localdisk
/dev/sda2 97M 5.6M 87M 7% /storedconfig
/dev/sda1 485M 25M 435M 6% /boot
tmpfs 5.9G 1.2G 4.7G 20% /dev/shm

 

后记:可以看到/opt的空间增加了,应该能缓一段时间了。在预装PI的时候建议大家在分配空间的时候多分配一些。以免导致其空间不足。