You need to be familiar with these sections:
LAVA is complex and administering a LAVA instance can be an open-ended task covering a wide range of skills.
These rules may seem harsh or obvious or tedious. However, multiple people have skipped one or more of these requirements and have learnt that these steps provide valuable advice and assistance that can dramatically improve your experience of LAVA. Everyone setting up LAVA, is strongly advised to follow all of these rules.
deploy
actions and boot
actions to be able to produce reliable
results.There are a number of common fallacies relating to automation. Check your test ideas against these before starting to make your plans:
connect & test seems simple enough - it doesn’t seem as if you need to deploy a new kernel or rootfs every time, no need to power off or reboot between tests. Just connect and run stuff. After all, you already have a way to manually deploy stuff to the board.
test everything at the same time - you’ve built an entire system and now you put the entire thing onto the device and do all the tests at the same time. There are numerous problems with this approach:
I already have builds - this may be true, however, automation puts extra demands on what those builds are capable of supporting. When testing manually, there are any number of times when a human will decide that something needs to be entered, tweaked, modified, removed or ignored which the automated system needs to be able to understand. Examples include:
/etc/resolv.conf
- it is common for many build tools to generate or
copy a working /etc/resolv.conf
based on the system within which the
build tool is executed. This is a frequent cause of test jobs failing due
to being unable to lookup web addresses using DNS. It is also common for an automated system to be in a different
network subnet to the build tool, again causing the test job to be unable
to use DNS due to the wrong data in /etc/resolv.conf
.Make use of the standard files for known working device types. These files come with details of how to rebuild the files, logs of the each build and checksums to be sure the download is correct.
Automation can do everything - it is not possible to automate every test method. Some kinds of tests and some kinds of devices lack critical elements that block automation. These are not problems in LAVA, these are design limitations of the kind of test and the device itself. Your preferred test plan may be infeasible to automate and some level of compromise will be required.
Users are all admins too - this will come back to bite! However, there are other ways in which this can occur even after administrators have restricted users to limited access. Test jobs (including hacking sessions) have full access to the device as root. Users, therefore, can modify the device during a test job and it depends on the device hardware support and device configuration as to what may happen next. Some devices store bootloader configuration in files which are accessible from userspace after boot. Some devices lack a management interface that can intervene when a device fails to boot. Put these two together and admins can face a situation where a test job has corrupted, overridden or modified the bootloader configuration such that the device no longer boots without intervention. Some operating systems require a debug setting to be enabled before the device will be visible to the automation (e.g. the Android Debug Bridge). It is trivial for a user to mistakenly deploy a default or production system which does not have this modification.
Administrators need to be mindful of the situations from which users can (mistakenly or otherwise) modify the device configuration such that the device is unable to booting without intervention when the next job starts. This is one of the key reasons for health checks to run sufficiently often that the impact on other users is minimised.
The ongoing roles of administrators include:
When you come across problems with your LAVA instance, there are some basic information sources, methods and tools which will help you identify the problem(s).
index:: jinja2 template administration
LAVA uses Jinja2 to allow devices to be configured using common data blocks,
inheritance and the device-specific device dictionary. Templates are
developed as part of lava-server
with supporting unit tests:
lava-server/lava_scheduler_app/tests/device-types/
Building a new package using the developer scripts will cause the updated templates to be installed into:
/etc/lava-server/dispatcher-config/device-types/
The jinja2 templates support conditional logic, iteration and default arguments
and are considered as part of the codebase of lava-server
. Changing the
templates can adversely affect other test jobs on the instance. All changes
should be made first as a developer. New
templates should be accompanied by new unit tests for that template.
Note
Although these are configuration files and package updates will
respect any changes you make, please talk to us
about changes to existing templates maintained within the lava-server
package.
lava-master - controls all V2 test jobs after devices have been assigned. Logs are created on the master:
/var/log/lava-server/lava-master.log
lava-scheduler - controls how all devices are assigned. Control will be
handed over to lava-master
once V1 code is removed. Logs are created on
the master:
/var/log/lava-server/lava-scheduler.log
lava-slave - controls the operation of the test job on the slave. Includes details of the test results recorded and job exit codes. Logs are created on the slave:
/var/log/lava-dispatcher/lava-slave.log
apache - includes XML-RPC logs:
/var/log/apache2/lava-server.log
gunicorn - details of the WSGI operation for django:
/var/log/lava-server/gunicorn.log
slave logs are transmitted to the master - temporary files used by the testjob are deleted when the test job ends.
job validation - the master retains the output from the validation of the
testjob performed by the slave. The logs is stored on the master as the
lavaserver
user - so for job ID 4321:
$ sudo su lavaserver
$ ls /var/lib/lava-server/default/media/job-output/job-4321/description.yaml
other testjob data - also stored in the same location on the master
are the complete log file (output.yaml
) and the logs for each specific
action within the job in a directory tree below the pipeline
directory.
Some device configuration can be overridden without making changes to the Jinja2 Templates. This does require some understanding of how template engines like jinja2 operate.
To identify which variables can be overridden, check the template for placeholders. A commonly set value for QEMU device types is the amount of memory (on the dispatcher) which QEMU will be allowed to use for each test job:
- -m {{ memory|default(512) }}
Most administrators will need to set the memory
constraint in the
device dictionary so that test jobs cannot allocate all the available
memory and cause the dispatcher to struggle to provide services to other test
jobs. An example device dictionary to override the default (and also prevent
test jobs from setting a different value) would be:
{% extends 'qemu.jinja2' %}
{% set memory = 1024 %}
Admins need to balance the memory constraint against the number of other devices on the same dispatcher. There are occassions when multiple test jobs can start at the same time, so admins may also want to limit the number of emulated devices on any one dispatcher to the number of cores on that dispatcher and set the amount of memory so that with all devices in use there remains some memory available for the system itself.
Most administrators will not set the arch
variable of a QEMU device so
that test writers can use the one device to run test jobs using a variety of
architectures by setting the architecture in the job context. The QEMU
template has conditional logic for this support:
{% if arch == 'arm64' or arch == 'aarch64' %}
qemu-system-aarch64
{% elif arch == 'arm' %}
qemu-system-arm
{% elif arch == 'amd64' %}
qemu-system-x86_64
{% elif arch == 'i386' %}
qemu-system-x86
{% endif %}
Note
Limiting QEMU to specific architectures on dispatchers which are not able to safely emulate an x86_64 machine due to limited memory or number of cores is an advanced admin task. Device tags will be needed to ensure that test jobs are properly scheduled.
The dispatcher uses a variety of constants and some of these can be overridden in the test job definition.
A common override used when operating devices on your desk or when a
PDU is not available, allows the dispatcher to recognise a soft reboot.
This uses the shutdown-message
parameter support in the u-boot
boot
action:
- boot:
method: u-boot
commands: ramdisk
type: bootz
parameters:
shutdown-message: "reboot: Restarting system"
prompts:
- 'linaro-test'
timeout:
minutes: 2
Note
If you are considering using MultiNode in your Test Plan, now is the time to ensure that MultiNode jobs can run successfully on your instance.
Once you have a couple of QEMU devices running and you are happy with how to maintain, debug and test using those devices, start adding known working devices. These are devices which already have templates in:
/etc/lava-server/dispatcher-config/device-types/
The majority of the known device types are low-cost ARM developer boards which are readily available. Even if you are not going to use these boards for your main testing, you are recommended to obtain a couple of these devices as these will make it substantially easier to learn how to administer LAVA for any devices other than emulators.
Physical hardware like these dev-boards have hardware requirements like:
Understanding how all of those bits fit together to make a functioning LAVA instance is much easier when you use devices which are known to work in LAVA.
Early admin stuff: