Best practices

One very important note, that serves as a general rule for all usage of the Green Metrics Tool:

All energy measurements and / or benchmarks on a normal operating system are by nature error prone and often incomparable with different systems. Please never do exact comparisons of published values with values on your system - Only use them as orientation. Measurements of software can only be compared on the exact same system.

Having said that: If you have a proper transfer function between systems or just want to estimate the general overhead a 100-core machine compared to an Arduino for just running an email server you can still do a comparison … just keep in mind, it will have caveats and can only provide guidance.

Also measurements should never be seen as ground truth,
but only as indicator of the order of magnitude.

Our system is designed to raise awareness and educate about
the software energy use in typical off-the-shelf systems.

This reduces its accuracy and reproducibility, but increases its general applicability.

The result is that you get an idea of the order of magnitude the energy consumption
is in, but reduces comparability to identical systems.

Our Hosted Service on our Measurement Cluster is designed for exactly that.

List of best practices

1. Never compare between machines to judge your software

At least not within small margins. Energy measurements on multi-task operating systems do always have noise and variance.
However a comparison by the order of magnitude is very helpful to judge the underlying hardware
- In order to judge software on different hardware your systems must be calibrated and run no non-deterministic components like schedulers (realtime linux kernel for instance)
Even systems with identical hardware components can have variations that you cannot easily account for, as there are unknown variables unless you measure them ahead (component energy consumption variance etc.)
Some comparisons make sense though if you have a tuned Measurement Cluster

2. An application should NEVER come to the bounds of its resources

Analyze the peak load of your application. If the system runs at >80% typically scheduling and queuing problems can kick in.
- If that is however what your application is design to operate it, then do not alter it. However most applications assume an infinite amout of resources and behave weirdly if they run into resource limitations

3. Sampling Rate

The sampling rate of your application should be 1/2 of the smallest event you want to capture and
The application / effect you want to measure must run at least twice as long as the minimal sampling rate

Both basically mean the same thing, but for different audiences one is better understandable than the other. So e.g. if you want to test a web page load, which can be ~10 ms you should sample with 5 ms interval or less.

The minimal sampling rate is the one you have configured with your Metric Providers

Be aware that some providers like for instance Wall Power measurement devices have a minimum time resolution of ~ 20ms, which is by definition the smallest possible sampling frequency due to it’s requirement to capture a full 50 Hz waveform. Other metric providers like RAPL can also capture down to 1 ms and CPU Utilization can be even below that depending on your kernel configuration.

4. When running tests your disk load should not go over 50%

Since typically linux systems can run in congestion above 60% and also our tool needs some disk time.
- Check iostat -xmdz if in doubt

5. Limit amount and sampling rate of Metric Providers to what you absolutely need

Do not exceed 10 Metric Reporters on 100 ms sampling rate,
or 5 metric reporters on < 10 ms sampling rate as this will produce a non-significant load on the system and might skew results.
Try to keep the sampling rate of all metric reporters identical. This allows for easier data drill-down later.

You can check the current overhead (CPU%, Memory, Disk) of the GMT if you activate a *_cgroup_system metrics provider.

6. Always check STDDEV

Optimally your tests should have in terms of energy a Std.Dev. of < 1% to make them reasonably comparable.
- We understand that if you have random effects in your code this might not be achievable. In that case opt for very high repetitions to get a narrower confidence interval.

7. Design representative Standard Usage Scenarios

When designing flows try to think of the standard usage scenario that is representative for the interaction with your app
- Factor in the idle time that your app has. Typically a web browser for instance is mostly idle, as users read.
- Nevertheless the browser does use the CPU during that time and consumes energy. Therefore it is an important part to have in your flow
Use notes to make flows better understandable

8. Pin your dependencies

If you build Docker containers be sure to always specify hashes / versions in the apt-get install commands and also in the FROM commands if you ingest images. By versions we mean here something like FROM alpine@sha256:be746ab119f2c7bb2518d67fbe3c511c0ea4c9c0133878a596e25e5a68b0d9f3 instead of just FROM alpine. If that is not an option be sure to use at least double-dotted semantic versioning like FROM alpine:1.2.3
For dependencies in npm, pip or any other package manager also pin the versions
Same goes for docker-compose.yml / compose.yml files etc.
This practice helps you spot changes to the software infrastructure your code is running on and understand changes that have been made by third parties, which influence your energy results.

9. Use temperature control and validate measurement std.dev.

Our Hosted Service with the Measurement Cluster checks periodically if the standard deviation of the measurements is within a certain allowed error margin.

It does this by running defined control workloads and also calibrating the machine beforehand so that any measurement only runs if a certain baseline temperature is reached again.

You can either use our service with a free tier or set the cluster up yourself. The setup and methodology is explained in Installation of a cluster

10. Trigger test remotely or keep system inactive

Our Measurement Cluster runs tests fully autonomous. In dev setups this is however seldomly the case. To still get good results the system should be as noise free as possible.
This means, if possible:
- Turn your wifi and internet off
- Do not touch the keyboard or the mouse
  - Never move your mouse or type something on your keyboard while measuring, because the interrupts of the CPU will interfere with the measurement.
- Do not have dimming or monitor-sleep active as this will cost CPU cycles to trigger
- Turn off any cronjobs / updates / housekeeping jobs on the system
- Turn off any processes you do not need atm.
Or put more loosely: Listening to spotify while running an energy test is a bad idea :)

11. Your system should not overheat

Most modern processors have features that limit their processing power if the heat of the system is too high.
- This is at the moment a manual task in the GMT, however we are working on a feature that will check if the CPU has run into a heat limiting.
Also you should take waiting times between test runs to make sure that the system has cooled down again and your energy measurements are not false-high. A good number for this has emerged in our testing which is 180 s. However on a 30+ core machine this value might be higher. We are currently working on a calibration script to determine this exact value for a particular system.

If you are using a standard cronjob mechanism to trigger the GMT you can use the post-test-sleep to force a fixed sleep time.

12. Mount your `/tmp` on `/tmpfs`

Since we extensively write the output of the metric-providers to /tmp on the host system this should be an in-memory filesystem. Otherwise it might skew with your measurement as disk-writes can be quite costly.

On Ubuntu you can use sudo systemctl enable /usr/share/systemd/tmp.mount

13. Manage logging appropriately

Logging of either stdout or stderr through the log-stdout and log-stderr keys in the usage_scenario is enabled by default in the GMT. In many cases the overhead of logging is small.

However, you should consider turning logging off when there is extensive logging output, as it can create overhead.

Since the logs will be captured into a memory buffer there is a limit to how much this buffer can hold. If you really log excessive amounts (100 MB+) then at some point the buffer might get exhausted and either you will lose data or the run with the GMT will fail.

14. Use `--docker-prune`

This switch will prune all unassociated build caches, networks volumes and stopped containers on the system and keep your disk from not getting full.

Downside: It will remove all stopped containers. So if you regularly keep stopped containers than avoid this switch and rather run docker volume prune once in a while.

15. Use non standard sampling intervals and avoid undersampling

If the effect you are looking for in your code is likely only a 200 ms activity you should at least use a sampling rate of 100 ms.

Having said that: It is also good practice to use an odd number here, which is slightly lower. For instance 99 ms or even 95 ms.

The reason for this is that you do not want to run into a lock-step sampling error, where you always look at the machine just after a load has happened, and since no jitter is on the machine you always miss the actual load. By sliding your sampling intervals in relation to the frequency of the event frequency that you want to observe you will still see the event sometimes.

16. System Check Threshhold

The GMT comes with many sytem checks that only issue a warning in the default configuration.

We recommend setting system_check_treshold to 2 in your production setup of the Configuration

17. Idle Duration

If you are trying to calculate an energy per container you should set the idle-duration configuration value high enough so you get a stable value to base the offset on.

We recommend at least 120 s if you have a non-controlled cluster system. 60 s if you are running on an accuracy controlled cluster like for instance our Measurement Cluster

18. Internal Networking Only

External networking introduces variable latency in your benchmarks and thus should be avoided whereever possible.

One method to achieve this is to set the network to internal only. See Docker Compose directive. GMT supports this feature natively.

In case you need access to external networking we recommend you at least try to cache the request in a warmup run and then run the final benchmark on the cached / internally mirrored result.

Helper tools

Green Metrics API

Docs

Green Metrics Tool

Title here

Best practices

List of best practices

1. Never compare between machines to judge your software

2. An application should NEVER come to the bounds of its resources

3. Sampling Rate

4. When running tests your disk load should not go over 50%

5. Limit amount and sampling rate of Metric Providers to what you absolutely need

6. Always check STDDEV

7. Design representative Standard Usage Scenarios

8. Pin your dependencies

9. Use temperature control and validate measurement std.dev.

10. Trigger test remotely or keep system inactive

11. Your system should not overheat

12. Mount your `/tmp` on `/tmpfs`

13. Manage logging appropriately

14. Use `--docker-prune`

15. Use non standard sampling intervals and avoid undersampling

16. System Check Threshhold

17. Idle Duration

18. Internal Networking Only

Best practices

List of best practices#

1. Never compare between machines to judge your software#

2. An application should NEVER come to the bounds of its resources#

3. Sampling Rate#

4. When running tests your disk load should not go over 50%#

5. Limit amount and sampling rate of Metric Providers to what you absolutely need#

6. Always check STDDEV#

7. Design representative Standard Usage Scenarios#

8. Pin your dependencies#

9. Use temperature control and validate measurement std.dev.#

10. Trigger test remotely or keep system inactive#

11. Your system should not overheat#

12. Mount your /tmp on /tmpfs#

13. Manage logging appropriately#

14. Use --docker-prune#

15. Use non standard sampling intervals and avoid undersampling#

16. System Check Threshhold#

17. Idle Duration#

18. Internal Networking Only#

List of best practices

1. Never compare between machines to judge your software

2. An application should NEVER come to the bounds of its resources

3. Sampling Rate

4. When running tests your disk load should not go over 50%

5. Limit amount and sampling rate of Metric Providers to what you absolutely need

6. Always check STDDEV

7. Design representative Standard Usage Scenarios

8. Pin your dependencies

9. Use temperature control and validate measurement std.dev.

10. Trigger test remotely or keep system inactive

11. Your system should not overheat

12. Mount your `/tmp` on `/tmpfs`

13. Manage logging appropriately

14. Use `--docker-prune`

15. Use non standard sampling intervals and avoid undersampling

16. System Check Threshhold

17. Idle Duration

18. Internal Networking Only