CPU and RAM consumption

Tue Jan 21 22:17:42 CET 2020

Florian Coulmier <Florian.Coulmier at vadesecure.com> writes:

> We use mandos on a large number of servers (more than 300).

We have not done any deployment of that size ourselves, so we cannot
speak from direct experience.

> We observe that mandos server consume a lot of CPU and RAM when it
> schedules checks of all clients.
[…]
> The checker we use is "ssh-keyscan".
> 
> I was wondering if this level of CPU/RAM consumption is expected or is
> it a bad configuration/setup on our side?

It is mostly expected, since we did not have such large number of
clients in mind back when we originally wrote the software.  Mind you,
it should still work, but performance might, as you have observed,
suffer a bit.

> From what I understand all the checks of clients are performed at the
> same time.

Not inherently; a checker for one specific client is run every
"interval" configured for that client.  Therefore, if you configured
your intervals for every client to be close but non-overlapping, you
could probably reduce the load spikes considerably without any code
changes.

> If checks was distributed at different time interval, do you think it
> could reduce The load on system?

Yes, but there is a slight problem; when a Mandos server starts, there
is no telling how long it has been down, and for security reasons, you
would normally want to run all the client checkers as soon as possible;
i.e. immediately on startup.  This *initial* load spike on startup may
be unavoidable.

However, the next checker run is currently scheduled to be exactly one
interval from the startup time, and so on.  This leads to all checkers
for all clients being run at more or less exactly the same time
periodically, but this is *not* an essential property of the system; it
does not have to be that way.  We might fix this in the Mandos code by
scheduling new checkers randomly spread out, *earlier* than one
interval.  You could try this patch and report back:

--- mandos	2019-09-03 19:06:41 +0000
+++ mandos	2020-01-21 21:12:33 +0000
@@ -88,6 +88,7 @@
 import ctypes.util
 import xml.dom.minidom
 import inspect
+import random
 
 if sys.version_info.major == 2:
     __metaclass__ = type
@@ -1037,7 +1038,7 @@
         if self.checker_initiator_tag is not None:
             GLib.source_remove(self.checker_initiator_tag)
         self.checker_initiator_tag = GLib.timeout_add(
-            int(self.interval.total_seconds() * 1000),
+            int(random.range(self.interval.total_seconds()+1) * 1000),
             self.start_checker)
         # Schedule a disable() when 'timeout' has passed
         if self.disable_initiator_tag is not None:

But see also what I said above; you could probably fix this immediately
for your own situation, *without* any code changes, by configuring small
differences (i.e. seconds) for each client.  This would make all checker
runs slowly drift apart over time, thereby reducing load spikes.

/Teddy Hogeborn

-- 
The Mandos Project
https://www.recompile.se/mandos
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://mail.recompile.se/pipermail/mandos-dev/attachments/20200121/51e65bf9/attachment.sig>