Recently we were involved in improving boot-up times of a Linux based platform. We noticed that the boot-up time was much longer than expected based on the processor speed.
The only way to figure out what was happening was to profile the start-up procedure. Something that is not always easy on limited embedded platforms like these! After getting some profiling tools up-and-running on the platform, we immediately detected an issue. About half of the boot time was spend into Linux exception handlers!
The busy exception handler was the one for handling invalid instruction. How was this possible?
Some further analysis showed us that the instructions being emulated in the kernel were the atomic swap instructions. It seemed that this version of the processor did not have support for atomic instructions!
Why would you need atomic instructions anyway?
Once you are running multi-threaded systems, there is a need to protect shared resources from simultaneous access. Without such protection mechanisms, data can get corrupted, something you do not want. Furthermore, such errors are not easy to track as they might lead to unwanted behavior a long time after the corruption has occurred.
In the good old times, there was no protection between user and kernel space and there were no multiple processors active in the system. At that time, locking could easily be done by masking all interrupts. In modern operating systems, applications are not allowed to do this. An easy solution is to provide a special atomic instruction which is guaranteed not to be interrupted (a swap or read-modify-write instruction). As such there is no need to jump to the kernel space which takes a large overhead (at least when no blocking is introduced by the lock).
How could we solve the problem?
We detected that one lock in the system was responsible for a high percentage of the exceptions. We replaced this lock with a semaphore implementation of Edsger Dijkstra (the inventor of the semaphore around 1965). The prerequisite for a semaphore (as stated by Dijkstra) is that it should be atomic. If there is no hardware available to do this, then other means have to be used to guarantee atomic behavior.
In this case this was done by using a state location for each thread accessing the semaphore. This state location is only written by this specific thread (thus array of states by thread index). This state can be “free”, “try to take”, “taken”. Simplified the algorithm is as follows: upon entry of the P() operation the thread set its own state to “try to take” after which it will loop through all other thread flags, if some other thread state is not set “free” then another thread is trying to have or having the semaphore… Once we enter the critical section the state is set “taken”. However, now a second loop is started to check again (it could be that some thread set its state after we passed its array index in the first loop).
Using the two loops with the triple state guarantees that contention is detected in all cases. Upon contention detection, arbitration is done on state level (highest level wins) and thread index (if on same state level).
The original semaphore implementation used a yield to solve the contention. At that time scheduling was not yet priority-based, but using only one task queue. In such an environment, a simple yield works. Also the semaphore was made fair, so that if multiple threads where blocked, the thread index arbitration was done in a round-robin way.
In our case we could not use the yield system as this only works correctly if all threads are running at the same real priority. It was also no option to replace this yield by more clever techniques like a lower level real Linux lock as this would introduce a race condition. So the only solution was to use a sleep.
Using this solution, round trips to the kernel were now avoided if there was no lock contention. However the much rarer contention cases now lead to sleeps. Using the fair round-robin arbitration of the original implementation lead to long delays. When a thread coming out of the contended lock detected that there was another thread waiting, it immediately went to sleep again… This resulted in an idle period where different contending threads were sleeping. This was avoided by introducing a complete unfair system: the one releasing a contended lock keeps running (so there was always at least one thread running). After all, after a blocked thread wakes-up from his sleep, in most cases it could immediately take the lock (as said, contention is the rare case, most of the times locks are not contended).
At the end the boot-up time was almost divided by two when compared with the original situation!
- Take care when selecting processors for your embedded system.
- In protected, multi-threaded environments atomic instructions are a must.
- Atomic instructions are used in libraries to handle locking primitives.
- Linux solves the problem for you by emulating these instructions if the processor lacks it, but at the cost of a drastic decrease in performance.