We have a vcenter environment with around 500 ESXi hosts running on multiple clusters and for the past several weeks we had the issue of Vcetner down because of VPXD crash and the service will be in stopped status.
VMware support identified the VCDB growth is huge with high CPU and memory usage on the vCenter and they started to investigate the same.
[Analysis]
Most of the memory usage is contributed by the Events
Signature 7f46978acb30 (Vmomi::KeyAnyValue) has 81430551 instances taking 0xd0aaf2b8(3,500,864,184) bytes.
Signature 562dc2a74a90 (Vmomi::Primitive<std::string>) has 55823822 instances taking 0x50096420(1,342,792,736) bytes.
Signature 7f4696eb77d0 (Vim::Event::ManagedEntityEventArgument) has 21574134 instances taking 0x37913be0(932,264,928) bytes.
Signature 7f4696eb7870 (Vim::Event::DatacenterEventArgument) has 13307524 instances taking 0x2205bdc0(570,801,600) bytes.
Signature 7f4696eb7960 (Vim::Event::HostEventArgument) has 13285493 instances taking 0x21b98d98(565,808,536) bytes.
Signature 7f4696eb78c0 (Vim::Event::ComputeResourceEventArgument) has 13280205 instances taking 0x21ddefb8(568,192,952) bytes.
Signature 562dc2a831f0 (Vmomi::DataArray<Vmomi::KeyAnyValue>) has 13086149 instances taking 0x21532ee8(559,099,624) bytes.
Signature 7f4696eb7d20 (Vim::Event::EventEx) has 13084710 instances taking 0xc31d4ee0(3,273,477,856) bytes.
Signature 7f4696eb7aa0 (Vim::Event::AlarmEventArgument) has 12934883 instances taking 0x20af9168(548,376,936) bytes.
Signature 562dc2ae8730 (Vmomi::Primitive<int>) has 12863119 instances taking 0x12707908(309,360,904) bytes.
Signature 7f4696eb4080 (Vim::Event::AlarmActionTriggeredEvent) has 4301050 instances taking 0x2b89b560(730,445,152) bytes.
Signature 7f4696eb4170 (Vim::Event::AlarmSnmpCompletedEvent) has 4301049 instances taking 0x272dbf98(657,309,592) bytes.
journalctl -xe
Jul 14 17:40:06 vCenter1 vpxd[20178]: Event [-22821230] [1-1] [2020-07-14T17:40:06.285759Z] [vim.event.AlarmActionTriggeredEvent] [info] [] [SantaClara] [-22821230] [Alarm ‘Host hardware sensor state’ on host triggered an action]
Jul 14 17:40:06 vCenter1 vpxd[20178]: Event [-22821225] [1-1] [2020-07-14T17:40:06.286465Z] [vim.event.EventEx] [info] [] [SantaClara] [-22821225] [Alarm ‘Host hardware sensor state’ on host triggered by event -42004159 ‘Sensor -1 type , Description Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Buffered Ring Agent #15 state assert for . Part Name/Number N/A N/A Manufacturer N/A’]
For 6 hours, the number of events is high
grep “Sensor -1 type” journalctl_-b–* | wc -l
371899
This is contributing towards, the VCDB growth.
VCDB=# SELECT nspname || ‘.’ || relname AS “relation”, pg_size_pretty(pg_total_relation_size(C.oid)) AS “total_size” FROM pg_class C LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace) WHERE nspname NOT IN (‘pg_catalog’, ‘information_schema’) AND C.relkind <> ‘i’ AND nspname !~ ‘^pg_toast’ ORDER BY pg_total_relation_size(C.oid) DESC LIMIT 20;
relation | total_size
———————+————
vc.vpx_event_arg_35 | 9661 MB
vc.vpx_event_arg_34 | 9568 MB
vc.vpx_event_arg_40 | 9543 MB
vc.vpx_event_arg_32 | 9427 MB
vc.vpx_event_arg_37 | 9082 MB
vc.vpx_event_arg_38 | 8742 MB
vc.vpx_event_arg_36 | 8721 MB
vc.vpx_event_arg_39 | 8249 MB
vc.vpx_event_arg_33 | 8169 MB
vc.vpx_event_arg_57 | 7957 MB
The number of events in DB
VCDB=# SELECT COUNT(EVENT_ID) AS NUMEVENTS, EVENT_TYPE, USERNAME FROM VPXV_EVENT_ALL GROUP BY EVENT_TYPE, USERNAME ORDER BY NUMEVENTS DESC LIMIT 10;
numevents | event_type | username
———–+——————————————–+————————–
58907485 | vim.event.AlarmActionTriggeredEvent |
58905170 | com.vmware.vc.StatelessAlarmTriggeredEvent |
58899874 | vim.event.AlarmSnmpCompletedEvent |
5894366 | com.vmware.vc.EventBurstStartedEvent |
5892674 | com.vmware.vc.EventBurstCompressedEvent |
5892554 | com.vmware.vc.EventBurstEndedEvent |
1564588 | vim.event.AlarmStatusChangedEvent |
702653 | vim.event.BadUsernameSessionEvent | root
694115 | esx.audit.account.locked |
300159 | vim.event.TaskEvent |
[Action Plan]
https://kb.vmware.com/s/article/74607
As per VMware it is known issue with 6.7 causes the burst of events causing the high IO on the vCenter.
You could update all the ESXi to 6.7 P01 or the latest version to fix the issue or follow the workaround mentioned in the KB.