OK, so what do you do when you need to optimize your code.
Where do you start ?
First, you either have already an idea where the CPU is being used most of the time, or you have to find out. There are several ways to measure this.
The border trick is one of them (use different colors, explained further below), but there are other methods, specially when you can run a debugger or inside an emulator.
Once you know the offending routines, you target them one by one or as a set (if they are related, and could be improved together).
Here is a short list of things you can do, to optimize an existing routine, in order of preference:
1 - Review your existing algorithm.
Your first implementation is usually not the best one, so reviewing what you did and why, is a good way, along with something else that you might have learned meanwhile, to improve your algorithm.
But in this bullet, the main idea is not to "improve", but to question and eventually completely replace your existing implementation with a new one.
As the saying goes, "there is more than one way to skin a cat".
Think out of the box! (not the cat litter box

)
2 - Review and/or reorder your registry allocation.
Specially when programming in assembly language, each CPU has its target market and typical operations and the designer of the language had specific goals in mind, either to optimize hardware or improve typical (target) applications workflow.
If you are not aligned with the designer mind set, you will probably implement your routines sub-optimally, because you will probably use more instructions then needed, or used them in an un-optimized way, requiring more push/pop pairs or more operations to achieve the same result.
Z80 has a lot of specialized instructions and you should use them when suitable, but don't get trapped to always do things the same way, sometimes there are better ways, when you conjugate 2, 3 or more factors into a single routine.
3 - Reverse Loops
The CPU doesn't care if you loop forward or backwards, hence, it might be useful, sometimes, to reverse the flow, either because the end value can be re-used next, or because it allows to use a few less push/pops, or prevents you from pre-calculating something, or allows you to use registers more efficiently.
Just remember, optimize for the loop, not for the pre-setup.
A loop usually runs many times, and a small improvement multiplied my many, can give a nice boost.
4 - Optimized for the statistically relevant branch
If your routine has an IF statement (conditional jump or call), you should first determine what is the most likely case, to branch/jump or not.
If it is to branch/jump, then you should reverse it, since branching typically takes more cycles than no branching.
So if you reverse the condition and adapt the code to this change, your code will flow more times through the faster sequence then through the slower one, which will give a benefit in the long run.
NOTE: obviously this does not apply to loop conditional jumps, although jumping is certainly more frequently in these cases, but that is the nature of the beast, a loop always jumps on its condition, except when exiting. Nothing you can do about it.
5 - Inline sub-routines
If inside a loop you call a sub-routine, and if loop runs a sufficient number of times, and sub-routine is small enough, you should inline the function.
To inline a routine, is just to place an adapted copy of a routine, locally, instead of calling it.
NOTE: Although this duplicates code, it avoids a CALL/RET pair on each iteration, which can make a huge difference specially in a very tight loop, or one that runs many times.
6 - Partial inline sub-routines
Sometimes, you can't, or shouldn't inline a sub-routine, because it is either large or complex, but it might have a quick exit condition for specific cases, which could be statistically most relevant, hence suggesting that a partial inline could be beneficial.
A partial inline, is just what the name suggest, only a partially adapted copy of the overall routine, that will branch/or call to the internals of the existing original routine, but in the statistically relevant case it will do it's work, without performing a CALL to the original routine.
7 - Unroll a loop to a reasonable size
If you have a very tight loop, just a few instructions, a simple jump (JR cond, address or DJNZ address, etc), might be a large percentage of the loop cycle time. So, removing a few of these jumps could be beneficial.
However this is usually only useful if the loop count is fixed or known before hand and if it has a common multiple (K*N).
Example: if loop always repeats in multiples of 4 (K=4), you can unroll your loop 4 times (copy its instructions 4 times sequentially), and just loop N times instead of K*N. This will result in an improvement since you reduce the total number of conditional jumps to 1/4 of it's original count.
NOTE: Remember, always optimize for what repeats more times, so optimize the inner loop, eventually sacrificing the loop pre-setup.
Warning: A larger pre-setup might make optimizations only worth after a minimum number of cycles, so don't over do it or you may loose performance.
8 - Create a specific case routine from a generic routine.
Routines are usually developed with a generic purpose, which means they account for general input situations, but sometimes, a specific set of inputs is more frequent than others, so some improvements may be obtained if a routine is implemented to optimize that particular case.
Suppose you have a sprite routine and that most of your sprites are 2x2 chars in size, instead of using the generic sprite routine that can handle all sizes of sprites, one can create a special 2x2 version to optimize for that specific case, which could include any of the above optimizations, like unrolling loops, inlining sub-routines, etc....
There is a lot more stuff you can do, but this is already a very long post.
Here is a list of optimizations I actually implemented to improve rendering speed from previous post.
Performance Improvements between versions 43 to 47

From the previous pictures, we know our performance is increasing, because the RED border area keeps getting smaller (going up) with each improvement.
How does this work ?
Before I start to "render" or blit stuff to the screen, I set border to RED, and when I'm finished I set it to BLUE.
So I'm using ULA screen render to visually measure the time it takes to finish the screen "render".
From left to right, here are the improvements applied:
Iteration 1 (First to Second, screen)
- Reviewed and improved original BLIT routine, improved register use.
- Manage to remove a push/pop pair.
- Inlined a sub-routine, which removes an extra CALL/RET pair in every loop cycle.
Iteration 2 (Second to Third screen)
- Partially inlined a sub-routine INCSY, inside BLIT routine
- Improved this sub-routine test condition (IF) in a single T-state.
- Optimized a statistically relevant branch
I also have a reverse blit routine, to allow me to reuse the same sprites, but flipped vertically, which development was based on the previous, so a good candidate to similar improvements.
Iteration 3 (Third to Fourth, screen)
- Reviewed and improved original Reverse BLIT routine.
- Reversed the inner loop, which allowed me to remove a push/pop pair.
- Inlined a sub-routine, which removes an extra CALL/RET pair in every loop cycle.
Iteration 4 (Fourth to Fifth screen)
- Partially inlined a sub-routine INCSY, inside Reverse BLIT routine
- Improved sub-routine test condition in a single T-state.
- Optimized a statistically relevant branch