根据监控平台信息,发现数据库平台节点2内存使用率过高,达到98%。通过查询占用内存较高的进程、检查TFA状态、同步TFA配置等方式,使得系统恢复正常运作。
概述
根据监控平台信息,发现某数据库平台节点2内存使用率过高,内存使用率达到98%。
1. 查询占用内存较高的进程
grid 280483 183124 0 18:21 ? 00:00:00 [asmcmd daemon]
grid 280493 155171 0 18:21 ? 00:00:00 [asmcmd daemon]
grid 280497 104733 0 18:21 ? 00:00:00 [asmcmd daemon]
grid 280499 187375 0 18:21 ? 00:00:00 [asmcmd daemon]
grid 280533 239249 0 18:21 ? 00:00:00 [asmcmd daemon]
grid 280534 157752 0 18:21 ? 00:00:00 [asmcmd daemon]
grid 280536 281960 0 18:21 ? 00:00:00 [asmcmd daemon]
grid 280545 69656 0 18:21 ? 00:00:00 [asmcmd daemon]
grid 280552 128541 0 18:21 ? 00:00:00 [asmcmd daemon]
grid 280553 63409 0 18:21 ? 00:00:00 [asmcmd daemon]
grid 280558 108705 0 18:21 ? 00:00:00 [asmcmd daemon]
grid 280575 194378 0 18:21 ? 00:00:00 [asmcmd daemon]
内存使用率暂用最高的为asmcmd daemon,这个进程究竟在做什么导致消耗这么高的内存呢?
记下来跟踪一下该进程过程。
wait4(163639, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 163639
open("/tmp/clsecho_stderr_file.txt", O_RDONLY) = 4
ioctl(4, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7ffc8333d180) = -1 ENOTTY (Inappropriate ioctl for device)
lseek(4, 0, SEEK_CUR) = 0
fstat(4, {st_mode=S_IFREG|0644, st_size=285, ...}) = 0
fcntl(4, F_SETFD, FD_CLOEXEC) = 0
read(4, "Can't open '/oracle/app/12.2.0/g"..., 8192) = 285
stat("/oracle/app/12.2.0/grid/bin/clsecho", {st_mode=S_IFREG|0755, st_size=11405, ...}) = 0
geteuid() = 1001
geteuid() = 1001
getegid() = 501
lseek(4, 99, SEEK_SET) = 99
lseek(4, 0, SEEK_CUR) = 99
pipe([6, 7]) = 0
pipe([8, 9]) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f059005ea10) = 164739
close(9) = 0
close(7) = 0
read(8, "", 4) = 0
close(8) = 0
ioctl(6, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7ffc8333d0f0) = -1 ENOTTY (Inappropriate ioctl for device)
lseek(6, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)
fstat(6, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
brk(0) = 0x81493d000
brk(0x81495e000) = 0x81495e000
read(6, "20-Jul-20 18:14 ASMCMD Backgroun"..., 8192) = 102
read(6, "", 8192) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=164739, si_status=0, si_utime=0, si_stime=195} ---
fstat(6, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
close(6) = 0
brk(0) = 0x81495e000
brk(0) = 0x81495e000
brk(0x81495c000) = 0x81495c000
brk(0) = 0x81495c000
wait4(164739, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 164739
close(4) = 0
open("/tmp/clsecho_stderr_file.txt", O_RDONLY) = 4
ioctl(4, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7ffc8333d180) = -1 ENOTTY (Inappropriate ioctl for device)
lseek(4, 0, SEEK_CUR) = 0
fstat(4, {st_mode=S_IFREG|0644, st_size=285, ...}) = 0
fcntl(4, F_SETFD, FD_CLOEXEC) = 0
read(4, "Can't open '/oracle/app/12.2.0/g"..., 8192) = 285
stat("/oracle/app/12.2.0/grid/bin/clsecho", {st_mode=S_IFREG|0755, st_size=11405, ...}) = 0
geteuid() = 1001
geteuid() = 1001
getegid() = 501
lseek(4, 99, SEEK_SET) = 99
lseek(4, 0, SEEK_CUR) = 99
pipe([6, 7]) = 0
pipe([8, 9]) = 0
clone(^CProcess 219522 detached
在这些进程上进行strace跟踪发现,无法连接到ASM实例以及对套接字文件不存在等大量无效调用。
"ASMCMD Background (PID = 118768): Invalid file handle for pipe /tmp/asmcmd_fg_118436" 2> /tmp/clsecho_stderr_file.txt
进一步的分析/tmp/clsecho_stderr_file.txt发现,但随着CPU的增加,这些进程正在系统地从系统中获取更多交换空间。
[root@ tmp]# more clsecho_stderr_file.txt
Can't open '/oracle/app/12.2.0/grid/log/diag/asmcmd/user_grid/xxssd2/alert/alert.log' for append
CLSU-00100: operating system function: open failed failed with error data: 2
CLSU-00101: operating system error message: No such file or directory
CLSU-00103: error location: SlfFopen1
这就很巧了该目录为TFA的诊断目录。说明当前TFA存在问题
2. 检查TFA状态
[grid@~]$ tfactlTFA-00104 Cannot establish connection with TFA Server. Please check TFA Certificates
果然,节点2存在问题,连不上TFA 服务,那么节点1呢?因为节点1此时没有发生内存使用过高情况。
节点1TFA情况:
.------------------------------------------------------------------------------------.
| xxssd1 |
+-----------------------------------------------------------------------+------------+
| Configuration Parameter | Value |
+-----------------------------------------------------------------------+------------+
| TFA Version | 19.2.1.0.0 |
| Java Version | 1.8 |
| Public IP Network | true |
| Automatic Diagnostic Collection | true |
| Alert Log Scan | true |
| Disk Usage Monitor | true |
| Managelogs Auto Purge | false |
| Trimming of files during diagcollection | true |
| Inventory Trace level | 1 |
| Collection Trace level | 1 |
| Scan Trace level | 1 |
| Other Trace level | 1 |
| Granular Tracing | false |
| Debug Mask (Hex) | 0 |
| Repository current size (MB) | 6908 |
| Repository maximum size (MB) | 10240 |
| Max Size of TFA Log (MB) | 50 |
| Max Number of TFA Logs | 10 |
| Max Size of Core File (MB) | 50 |
| Max Collection Size of Core Files (MB) | 500 |
| Max File Collection Size (MB) | 5120 |
| Minimum Free Space to enable Alert Log Scan (MB) | 500 |
| Time interval between consecutive Disk Usage Snapshot(minutes) | 60 |
| Time interval between consecutive Managelogs Auto Purge(minutes) | 60 |
| Logs older than the time period will be auto purged(days[d]|hours[h]) | 30d |
| Automatic Purging | true |
| Age of Purging Collections (Hours) | 12 |
| TFA IPS Pool Size | 5 |
| TFA ISA Purge Age (seconds) | 604800 |
| TFA ISA Purge Mode | profile |
| TFA ISA Purge Thread Delay (minutes) | 60 |
| Setting for ACR redaction (none|SANITIZE|MASK) | none |
| Email Notification will be sent for CHA EVENTS if address is set | false |
| AUTO Collection will be generated for CHA EVENTS | false |
tfactl> status
.-----------------------------------------------------------------------------------------------.
| Host | Status of TFA | PID | Port | Version | Build ID | Inventory Status |
+----------+---------------+------+------+------------+----------------------+------------------+
| xxssd1 | RUNNING | 8075 | 5000 | 19.2.1.0.0 | 19210020190425110550 | COMPLETE |
| xxssd2 | NOT RUNNING | - | | | | |
'----------+---------------+------+------+------------+----------------------+------------------'
节点1运行正常,节点2没有运行,多次手动启动没有反应,报错如下:
Unable to determine the status of TFA in other nodes.
说明TFA 节点互联状态已经失效了。
3. 同步TFA配置
如果另一个节点TFA存在问题,那么可以在正常节点进行同步配置。
WARNING - TFA Software is older than 180 days. Please consider upgrading TFA to the latest version.
Current Node List in TFA :
1. xxssd1
2. xxssd2
Node List in Cluster :
1. xxssd1
2. xxssd2
Node List to sync TFA Certificates :
1 xxssd2
Do you want to update this node list? [Y|N] [N]:
Syncing TFA Certificates on xxssd2 :
TFA_HOME on xxssd2 : /oracle/app/12.2.0/grid/tfa/xxssd2/tfa_home
Please Enter the password for xxssd2 :
Is password same for all the nodes? [Y|N] [Y]: Y
Shutting down TFA on xxssd2...
Copying TFA Certificates to xxssd2...
Copying SSL Properties to xxssd2...
Shutting down TFA on xxssd2...
Sleeping for 5 seconds...
Starting TFA on xxssd2...
WARNING - TFA Software is older than 180 days. Please consider upgrading TFA to the latest version.
.-------------------------------------------------------------------------------------------------.
| Host | Status of TFA | PID | Port | Version | Build ID | Inventory Status |
+----------+---------------+--------+------+------------+----------------------+------------------+
| xxssd1 | RUNNING | 8075 | 5000 | 19.2.1.0.0 | 19210020190425110550 | COMPLETE |
| xxssd2 | RUNNING | 230525 | 5000 | 19.2.1.0.0 | 19210020190425110550 | COMPLETE |
'----------+---------------+--------+------+------------+----------------------+------------------'
4. 后续处理
TFA配置完成后,内存的使用率就开下降,内存释放。
total used free shared buff/cache available
Mem: 1007 942 16 5 49 54
Swap: 31 0 31
[root@xxssd2 ~]# free -g
total used free shared buff/cache available
Mem: 1007 907 50 5 49 89
Swap: 31 0 31
[root@xxssd2 ~]# free -g
total used free shared buff/cache available
Mem: 1007 827 131 5 48 169
Swap: 31 0 31
[root@xxssd2 ~]# free -g
total used free shared buff/cache available
Mem: 1007 820 137 5 48 176
Swap: 31 0 31
[root@xxssd2 ~]# free -g
total used free shared buff/cache available
Mem: 1007 745 213 5 48 251
Swap: 31 0 31
[root@xxssd2 ~]# free -g
total used free shared buff/cache available
Mem: 1007 745 213 5 48 251
Swap: 31 0 31
[root@xxssd2 ~]# free -g
total used free shared buff/cache available
Mem: 1007 745 213 5 48 251
Swap: 31 0 31
[root@xxssd2 ~]# free -g
total used free shared buff/cache available
Mem: 1007 745 213 5 48 251
Swap: 31 0 31
[root@xxssd2 ~]# free -g
total used free shared buff/cache available
Mem: 1007 745 213 5 48 251
Swap: 31 0 31
[root@xxssd2 ~]# free -g
total used free shared buff/cache available
Mem: 1007 745 213 5 48 251
Swap: 31 0 31
5. 总结
TFA(Trace File Analyzer Collector)是个11.2版本上推出的用来收集Grid Infrastructure/RAC环境下的诊断日志的工具,它可以用非常简单的命令协助用户收集RAC里的日志,以便进一步进行诊断;TFA是类似diagcollection的一个oracle 集群日志收集器,而且TFA比diagcollection集中和自动化的诊断信息收集能力更强大。
建议生产环境数据库均关闭TFA自动收集、分析功能(Autodiagcollect)从而避免类似情况发生影响生产环境数据库的正常运行。
.------------------------------------------------------------------------------------.
| gatzyca1 |
+-----------------------------------------------------------------------+------------+
| Configuration Parameter | Value |
+-----------------------------------------------------------------------+------------+
| TFA Version | 19.2.1.0.0 |
| Java Version | 1.8 |
| Public IP Network | true |
| Automatic Diagnostic Collection | true |
注:关闭自动收集、分析功能不影响数据库正常运行,不影响TFA的日志收集、整合以及打包功能。
root用户执行:
tfactl set autodiagcollect = OFF
Copyright© 2013-2020
All Rights Reserved 京ICP备2023019179号-8