User Tools

Site Tools


smallcpus

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
smallcpus [2012/12/02 17:54]
kris
smallcpus [2015/10/21 19:05] (current)
kris
Line 5: Line 5:
 ===== Enabling Assembler Programming on Arduino/​Uno32 ===== ===== Enabling Assembler Programming on Arduino/​Uno32 =====
  
-It is hard to to get a complete ​understanding of Arduino and the various CPUs involved. The standard IDE involves a lot of '​magic'​ and there is not much help when you want to get a better understanding of how the magic works.+**More recent versions of the Arduino IDE recognize .S files, and no changes are needed. I've tested this with version 1.6.5. If you're using a recent IDE, you can skip ahead to 'The Blink Example In Assembler'​** 
 + 
 +**Make sure to also check '​Feedback from Readers'​ at the end: it contains some important info for newer versions of the IDE.** 
 + 
 +It is hard to get an in-depth ​understanding of Arduino and the various CPUs involved. The standard IDE involves a lot of '​magic'​ and there is not much help when you want to get a better understanding of how the magic works.
  
 I had to do a lot of digging to figure out how to easily do assembler programming on the Arduino; below the results of my information quest. I had to do a lot of digging to figure out how to easily do assembler programming on the Arduino; below the results of my information quest.
Line 103: Line 107:
 wherever_you_put_it/​Arduino-master/​build/​macosx/​arduino-0102asm-macosx.zip wherever_you_put_it/​Arduino-master/​build/​macosx/​arduino-0102asm-macosx.zip
  
-I then decompressed this (which ​gives you Arduino.app) and then moved this patched IDE into my /​Applications folder instead of the '​official'​ downloadable version.+I then decompressed this (which ​gave me Arduino.app) and then moved this patched IDE into my /​Applications folder instead of the '​official'​ downloadable version.
  
 On Linux it works pretty much the same way; I haven'​t tried it on Windows, but I imagine it to be similar too. On Linux it works pretty much the same way; I haven'​t tried it on Windows, but I imagine it to be similar too.
Line 123: Line 127:
 #ifdef __ASSEMBLER__ #ifdef __ASSEMBLER__
  
-/* assembler-only stuff */+/* Assembler-only stuff */
  
 #else  /* !ASSEMBLER */ #else  /* !ASSEMBLER */
 +
 +/* C-only stuff */
  
 #include <​stdint.h>​ #include <​stdint.h>​
Line 135: Line 141:
 </​code>​ </​code>​
  
-This defines the assembler routines in such a way that they can be called from a C/C++ program. To avoid issues with C++ name mangling, I defined the functions as _extern ​"​C"​- this tells the C/C++ compiler that the underlying function is pure C, not C++ and does not need name mangling.+This defines the assembler routines in such a way that they can be called from a C/C++ program. To avoid issues with C++ name mangling, I defined the functions as //​extern ​"​C"​// - this tells the C/C++ compiler that the underlying function is using pure C calling conventions as opposed to C++ calling conventions, ​and hence does not need name mangling.
  
 In the file asmtest.S we get the assembler code: In the file asmtest.S we get the assembler code:
Line 183: Line 189:
  
 In other words, I still define the setup() and loop() functions, and these then call into my assembler functions. In other words, I still define the setup() and loop() functions, and these then call into my assembler functions.
 +
 +This first test is only '​half-assembler'​ - we still have some C/C++ backbone, but the difference in size is significant already. A standard '​Blink'​ sketch compiles to 1084 bytes. My assembler version is only 582 bytes.
 +
 +Assembler is most often not the right language to do things in, as the most expensive resource is often the developer'​s time, and doing things in assembler is slower, more error-prone,​ and less efficient than C. However, when memory or CPU time is tight, using assembler can result in substantial space savings and speed improvements.
 +
 +===== The BlinkWithoutDelay Example in Assembler =====
 +
 +My next exercise was rebuilding the BlinkWithoutDelay example, my first attempt reducing code size from 1028 bytes to 580 bytes. The assembler part is the same as in the Blink example.
 +
 +asmtest.h:
 +<​code>​
 +/*
 + * Global register variables.
 + */
 +#ifdef __ASSEMBLER__
 +
 +/* Assembler-only stuff */
 +
 +#else  /* !ASSEMBLER */
 +
 +/* C-only stuff */
 +
 +#include <​stdint.h>​
 +
 +extern "​C"​ uint8_t led(uint8_t);​
 +extern "​C"​ uint8_t asminit(uint8_t);​
 +
 +#endif /* ASSEMBLER */
 +</​code>​
 +
 +asmtest.S:
 +<​code>​
 +#include "​avr/​io.h"​
 +#include "​asmtest.h"​
 +
 +.global asminit
 +asminit:
 +sbi  4,5; 4 = DDRB (0x24 - 0x20). Bit 5 = pin 13
 +ret
 +
 +.global led ; The assembly function must be declared as global
 +led:
 +cpi r24, 0x01 ; Parameter passed by caller in r24
 +breq turnoff
 +sbi 5, 5; 5 = PORTB (0x25 - 0x20). Bit 5 = pin 13
 +ret
 +turnoff:
 +cbi 5, 5; 5 = PORTB (0x25 - 0x20). Bit 5 = pin 13
 +ret
 +</​code>​
 +
 +sketch.ino:
 +<​code>​
 +#include "​asmtest.h"​
 +
 +int x = 0;
 +int on = 1;
 +
 +void setup()
 +{
 +  asminit(0);
 +}
 +
 +void loop()
 +{
 +    int y;
 +    if ((y = millis()) - x > 0)
 +    {
 +      x = y + 1000;
 +      on = -on;
 +      led(on);
 +    } 
 +}
 +</​code>​
 +
 +===== BlinkWithoutDelay in assembler only ======
 +
 +The next step I tried was to build a version of BlinkWithoutDelay that only uses assembler. This is what I came up with:
 +
 +Make a new sketch, and empty the complete .ino file. You still need to have it, but it should be empty.
 +
 +Then also create a file called '​asmtest.S'​ in the same directory as the (empty) .ino file:
 +
 +<​code>​
 +#include "​avr/​io.h"​
 +
 +#define yl r28
 +#define yh r29
 +
 +.global setup
 +setup:
 +  sbi  _SFR_IO_ADDR(DDRB),​ DDB5 ; Bit 5 = pin 13
 +  ret
 +
 +// const long delay = 1000;
 +#define delay 1000 // ms
 +
 +.global loop
 +loop:
 +
 +  push yl
 +  push yh
 +
 +  call millis ​  ; call millis(): 4-byte return value in r25...r22
 +  ​
 +  // Use Y as a pointer to fetch the next time to switch the LED
 +  ldi yl, lo8(nextSwitchAfterMillis) ​
 +  ldi yh, hi8(nextSwitchAfterMillis)
 +  ​
 +  ld r18, y+
 +  ld r19, y+
 +  ld r20, y+
 +  ld r21, y+
 +  ​
 +  ld r17, y // ledStatus comes immediately after lastMillis, so we can use y
 +  ​
 +  // Compare nextSwitchAfterMillis with value returned by millis()
 +  sub r18, r22
 +  sbc r19, r23
 +  sbc r20, r24
 +  sbc r21, r25
 +  ​
 +  brcc tooEarly ; carry is set if r18...r21(nextSwitchAfterMillis) < r22...r25(millis())
 +
 +  // Toggle LED state: 0 -> 1, 1 -> 0  ​
 +  inc r17
 +  andi r17, 1
 +  ​
 +  // Store ledStatus for next time. y still points at its memory location
 +  st y, r17
 +  ​
 +  // set LED state 
 +  brne turnoff
 +  ​
 +  cbi _SFR_IO_ADDR(PORTB),​ PORTB5; Bit 5 = pin 13
 +  rjmp ledSwitched  ​
 +  ​
 +turnoff:
 +  sbi _SFR_IO_ADDR(PORTB),​ PORTB5; Bit 5 = pin 13
 +  ​
 + 
 +ledSwitched:​
 +  // Add long delay; to result of call to millis()
 +  ldi r17, lo8(delay)
 +  add r22, r17
 +  ldi r17, hi8(delay)
 +  adc r23, r17
 +  ldi r17, hlo8(delay)
 +  adc r24, r17
 +  ldi r17, hhi8(delay)
 +  adc r25, r17
 +  ​
 +  // Store this as the next point in time when we need to toggle the LED
 +  st -y, r25
 +  st -y, r24
 +  st -y, r23
 +  st -y, r22
 +  ​
 +tooEarly: ​
 +  pop yh
 +  pop yl
 +  ret
 +
 +.data 
 +
 +nextSwitchAfterMillis:​
 +.long 0
 +
 +ledStatus:
 +.byte 0
 +</​code>​
 +
 +A few tidbits: ​
 +
 +- I changed the //sbi 4,5// and similar to something like //sbi  _SFR_IO_ADDR(DDRB),​ DDB5// using predefined symbols that are defined in the "​avr/​io.h"​ include file, so the assembler code better expresses what it does. Underneath, it's still exactly the same - so there is no cost to doing this, but the code becomes more self-explanatory.
 +
 +- I defined the setup() and loop() functions in assembler instead of in C. The Arduino '​wrapper'​ that is automatically compiled and linked in together with my code defines both setup() and loop() as '​extern "​C"'​ routines, so the Arduino '​runtime'​ will find these routines, even though they'​re defined in assembler instead of C.
 +
 +- I am calling millis() from assembler. This routine returns a 4-byte long; the assembler routine uses this long for comparison and for calculations. The millis() routine uses r25..r22 to return the long value, which are the standard AVR calling conventions.
 +
 +- I am using the Y register (composed of r29..r28) as a '​pointer'​ into memory, using post-increment and pre-decrement to access a sequence of 5 bytes. 4 bytes are used for the millis value when the LED will be toggled, and another byte to contain the LED's current state in bit 0.
 +
 +- I learned the hard way you need to save and restore the contents of r29..r28 when you clobber them. Hence the push... and pop... of yh/yl at the start and end of the loop routine.
 +
 +- This version of the routine takes 576 bytes, so it only saved 4 extra bytes from the previous '​hybrid C++/​asm'​ version.
 +
 +- I also tried compiling a sketch with an empty loop() and setup() (both composed of just a '​ret'​ assembler instruction). Such a sketch ​ takes 466 bytes. The two '​ret'​ instructions are 4 bytes, so the Arduino minimal '​runtime'​ weighs in at 462 bytes. That means that my last BlinkWithoutDelay needs about 576 - 466 = 110 bytes. ​
 +
 +===== Knight Rider LEDs =====
 +
 +I'm working my way through the tutorials at www.mindkits.co.nz:​
 +
 +http://​www.mindkits.co.nz/​tutorials
 +
 +as I ordered their ready-made tutorial stuff. I haven'​t progressed very far yet, as I keep getting sidetracked on all kinds of interesting thing, like assembler-programming tricks.
 +
 +My next subject is an assembler version for part of Tutorial#0, where you wire up 8 LEDs to pins 2-9 of the Arduino and then make the LEDs light up in sequence, from left to right, then back from right to left, and so on.
 +
 +Look for Exercise 0.1 on 
 +
 +http://​www.mindkits.co.nz/​tutorials/​arduino_tutorials/​Tutorial-0
 +
 +First I did a C version, which is pretty simple. I have a variable named '​step'​ which is either +1 (when going '​up'​) or -1 (when going '​down'​). Start at the first LED, add the step, until you hit the '​upper'​ LED, then change the sign on the step (+1 becomes -1, -1 becomes +1), continue until you hit the '​lower'​ LED, change the sign on the step, and so on...
 +
 +This initial version never returns from the loop routine - which is not as tidy, but it works. Total size of compiled sketch: 1150 bytes.
 +
 +<​code>​
 +#define ledFrom ​ 2
 +#define ledTo 9
 +#define ledJump (ledTo - ledFrom)
 +
 +char cur;
 +char step;
 +char destLed;
 +
 +// the setup routine runs once when you press reset:
 +void setup() {                ​
 +  // initialize the digital pin as an output.
 +  for (char i = ledFrom; i <= ledTo; i++)
 +  {
 +    pinMode(i, OUTPUT);
 +  }
 +  cur = ledFrom;
 +  step = 1;
 +  destLed = ledTo;
 +}
 +
 +// the loop routine runs over and over again forever:
 +void loop() {
 +  if (cur == destLed)
 +  {
 +    if (step > 0)
 +    {
 +      destLed -= ledJump;
 +    }
 +    else
 +    {
 +      destLed += ledJump;
 +    }
 +    step = -step;
 +  }  ​
 +  digitalWrite(cur,​ LOW);   // turn the LED on (HIGH is the voltage level)
 +  cur += step;
 +  digitalWrite(cur,​ HIGH); ​  // turn the LED on (HIGH is the voltage level)
 +  delay(100);
 +}
 +</​code>​
 +
 +Then I wrote assembler-only version. In this version, the .ino file remains empty, and all you do is add a file '​knightrider.S'​ to the sketch on a second tab. You need to use the patched IDE for this - standard IDEs won't recognize the .S file. As you'll see further, this first attempt is far from optimal.
 +
 +<​code>​
 +#include "​avr/​io.h"​
 +
 +#define yl r28
 +#define yh r29
 +
 +.global setup
 +setup:
 +  push yl
 +  push yh
 +  ​
 +  ldi yl, lo8(DDRD)
 +  ldi yh, hi8(DDRD)
 +  ld r25, y // DDRD: bit 2 = pin 2 .. bit 7 = pin 7
 +  ori r25, 0xFC
 +  st y, r25
 +  ld r25, y // DDRB: bit 0 = pin 8, bit 1 = pin 9
 +  ldi yl, lo8(DDRB)
 +  //ldi yh, hi8(DDRB) // is same, no need to reload
 +  ori r25, 0x03
 +  st y, r25
 +  ​
 +  pop yh
 +  pop yl  ​
 +  ​
 +  ret
 +
 +// const long delay = 1000;
 +#define delay 50 // ms
 +
 +.global loop
 +loop:
 +
 +  push yl
 +  push yh
 +
 +  call millis ​  ; call millis(): 4-byte return value in r25...r22
 +  ​
 +  // Use Y as a pointer to fetch the next time to switch the LED
 +  ldi yl, lo8(nextSwitchAfterMillis) ​
 +  ldi yh, hi8(nextSwitchAfterMillis)
 +  ​
 +  ld r18, y+
 +  ld r19, y+
 +  ld r20, y+
 +  ld r21, y+
 +  ​
 +  // ldi yl, lo8(shifters) not needed; y is already correct ​
 +  // ldi yh, hi8(shifters)
 +  ld r17, y+ 
 +  ld r16, y
 +  ​
 +  // Compare nextSwitchAfterMillis with value returned by millis()
 +  cp r22, r18
 +  cpc r23, r19
 +  cpc r24, r20
 +  cpc r25, r21
 +  ​
 +  brcs tooEarly ; carry is clear if r18...r21(nextSwitchAfterMillis) < r22...r25(millis())
 +
 +  // Rotate left & right; bit travels up then down through r17 and r16
 +  // Carry is clear, no need for CLC
 +  rol r17
 +  sbrc r16,0
 +  ori r17,1
 +  ror r16
 +
 +  st y, r16
 +  st -y, r17
 +
 +  // Combine left and right shifter
 +  or r17,r16
 + 
 +  // 6 high bits of port D
 +  ldi yl,​lo8(PORTD)
 +  ldi yh,​hi8(PORTD)
 +  mov r16, r17
 +  lsl r16 // Zeroes 2 low bits of r16
 +  lsl r16
 +  st y,r16
 +  ​
 +  // 2 low bits of port B
 +  ldi yl,​lo8(PORTB)
 +  //ldi yh,​hi8(PORTB) // is already same
 +  ldi r16,6
 + ​shift6:​
 +  lsr r17 // Zeroes high bits of r17
 +  dec r16
 +  brne shift6
 +  st y,r17
 +  ​
 +   // Add delay to result of call to millis()
 +  ldi r17, lo8(delay)
 +  add r22, r17
 +  ldi r17, hi8(delay)
 +  adc r23, r17
 +  ldi r17, hlo8(delay)
 +  adc r24, r17
 +  ldi r17, hhi8(delay)
 +  adc r25, r17
 +  ​
 +  // Store this as the next point in time when we need to toggle the LED
 +  ldi yl, lo8(nextSwitchAfterMillis) ​
 +  ldi yh, hi8(nextSwitchAfterMillis)
 +  ​
 +  st y+, r22
 +  st y+, r23
 +  st y+, r24
 +  st y+, r25
 +  ​
 +tooEarly: ​
 +  pop yh
 +  pop yl
 +  ​
 +  ret
 +
 +.data 
 +
 +nextSwitchAfterMillis:​
 +.long 0
 +
 +shifters:
 +.byte 1
 +.byte 0
 +</​code>​
 +
 +This one weighs in at 630 bytes. ​
 +
 +The trick I used is to have two 8-bit values, where one is being rotated right-to-left,​ and the other is being rotated left-to-right. The //rol// and //ror// instructions rotate through the carry flag. 
 +
 +I.e. whatever bit 'falls out' of the register being rotated falls into the carry flag, and the original contents of the previous carry flag are rotated '​into'​ the register. //rol// and //ror// are effectively 9-bit rotations, where the Carry flag is the 9th bit.
 +
 +The two 8-bit values are kept in r17 and r16; looking at the two 8-bit values you'd see successive states like shown below, because of the //rol r17// and //ror r16// instructions:​
 +
 +<​code>​
 +Step 1:
 +
 +r17:​00000001
 +r16:​00000000
 +
 +Step 2:
 +
 +r17:​00000010
 +r16:​00000000
 +
 +Step 3:
 +
 +r17:​00000100
 +r16:​00000000
 +
 +Step 4:
 +
 +r17:​00001000
 +r16:​00000000
 +
 +...
 +
 +Step 8:
 +
 +r17:​10000000
 +r16:​00000000
 +
 +Step 9: (the 1 bit rotates '​out'​ of r17 into the carry, then from the carry into the topmost bit of r16)
 +
 +r17:​00000000
 +r16:​10000000
 +
 +Step 10:
 +
 +r17:​00000000
 +r16:​01000000
 +
 +Step 11:
 +
 +r17:​00000000
 +r16:​00100000
 +
 +...
 +
 +Step 16:
 +
 +r17:​00000000
 +r16:​00000001
 +
 +Step 17: The 1 bit rotates '​out'​ of r16 into the carry, and disappears from view. However, before
 +that happens, the sbrc r16,0 instruction tests for the '​1'​ bit being in the bit-position 0 of r16,
 +and if it is, executes an ori r17, 1 - effectively re-instating the bit into r17, ready for another
 +up-down round.
 +
 +r17:​00000001
 +r16:​00000000
 +</​code>​
 +
 +So, we now have this single bit, marching endlessly around the two 8-bit values. Round and round it goes.
 +
 +The second trick is to then make the '​logical or' of these two values. That gives us a single-byte end-result where the bit seems to go back and forth all the time. That's the value we will use to drive our LEDs.
 +
 +Before we can use this 8-bit value, we need to do some more shifting, because the LED's are driven from two different ports: 6 LEDs are driven by bit 2-7 of PORTD, and 2 LED are driven by bit 0-1 of PORTB. That's what all the //lsl// and //lsr// stuff is about.
 +
 +The rest of the code is very similar to our previous BlinkWithoutDelay as far as calculating millis and so on - so this version returns properly from the '​loop'​ routine each time it is called; it does not get '​stuck'​.
 +
 +Now, I thought I should be able to do better, so I rewired the LEDs in such a fashion that I could drop those extra shifts.
 +
 +<​code>​
 +Led 1 = pin 8 PORTB bit 0
 +Led 2 = pin 9 PORTB bit 1
 +--
 +Led 3 = pin 2 PORTD bit 2
 +Led 4 = pin 3 PORTD bit 3
 +Led 5 = pin 4 PORTD bit 4
 +Led 6 = pin 5 PORTD bit 5
 +Led 7 = pin 6 PORTD bit 6
 +Led 8 = pin 7 PORTD bit 7
 +</​code>​
 +
 +This way, I do not need to shift the bits around any more - the bottommost two bits of PORTB are driven by bits 0 and 1 of my calculated value, and the topmost 6 bits of PORTD can be driven by bits 2-7, without needing any additional shifts after the rol/ror trick.
 +
 +One of my goals is to get a '​feel'​ for how much it costs to use C instead of assembly. In order to have a useful comparison, I decided to rewire my board, then try writing a C version first, this time using direct port manipulation instead of //​directWrite//​. ​
 +
 +//​directWrite//​ is a nice abstraction,​ but there is quite a memory and timing-overhead attached to it, and direct port manipulation in C generates much more compact code.
 +
 +So here is my C version for the rewired setup, no assembler involved.
 +
 +<​code>​
 +#include <​Arduino.h>​
 +
 +void setup()
 +{
 +  DDRD = DDRD | 0xFC; // Top 6 bits as outputs
 +  DDRB = DDRB | 0x03; // Bottom 2 bits as outputs
 +}
 +
 +#define delay 50
 +
 +long nextSwitchAfterMillis;​
 +byte shiftUp = 0x01;
 +byte shiftDown = 0x00;
 +
 +void loop()
 +{
 +  long curTime = millis();
 +  if (curTime > nextSwitchAfterMillis)
 +  {
 +    nextSwitchAfterMillis = curTime + delay;
 +    byte tempShift = shiftUp;
 +    shiftUp = (tempShift << 1) | (shiftDown & 1);
 +    shiftDown = (tempShift & 0x80) | (shiftDown >> 1);
 +    tempShift = shiftUp | shiftDown;
 +    PORTD = tempShift & 0xFC;
 +    PORTB = tempShift & 0x03;
 +  }
 +}
 +</​code> ​
 +
 +This one compiles to 610 bytes, and that surprised me - that is way smaller than I expected!
 +
 +I then did some digging, and checked out the assembly language output of the C-compiler, and as a result, learned a few new tricks.
 +
 +First trick: the AVR instruction set has no '​ADCI'​ or '​ADI'​ instruction. My solution was to load the value into a register and then use ADC or ADD. 
 +
 +The C-compiler has a much better trick up it's sleeve: the AVR instruction set //does// have a SBCI and SBI instruction,​ so instead of adding a constant value, it simply subtracts the negative of the value, and no intermediate register is needed to hold the immediate values.
 +
 +I also completely overlooked the IN and OUT instructions,​ and instead was using memory addressing to access DDRB, DDRD, PORTB, PORTD as memory locations. The C compiler instead uses IN and OUT, which again saved a few bytes.
 +
 +While I was looking over the instruction set I also noticed a few more instructions that allowed me to save some more bytes: I found STS and LDS, which allow addressing via the Y register but with an offset applied.
 +
 +So, I finally came up with this:
 +
 +<​code>​
 +#include "​avr/​io.h"​
 +
 +//
 +// Adjusted wiring, which allows us to suppress some shift
 +// instructions
 +//
 +// Led 1 = pin 8
 +// Led 2 = pin 9
 +// Led 3 = pin 2
 +// Led 4 = pin 3
 +// Led 5 = pin 4
 +// Led 6 = pin 5
 +// Led 7 = pin 6
 +// Led 8 = pin 7
 +//
 +
 +#define yl r28
 +#define yh r29
 +
 +// const long delay = 100;
 +#define delay 100 // ms
 +
 +.global setup
 +setup:
 +  in r25, _SFR_IO_ADDR(DDRD)
 +  ori r25, 0xFC
 +  out _SFR_IO_ADDR(DDRD),​ r25
 +  ​
 +  in r25, _SFR_IO_ADDR(DDRB)
 +  ori r25, 0x03
 +  out _SFR_IO_ADDR(DDRB),​ r25
 +
 +  ret
 +
 +.global loop
 +loop:
 +
 +  push yl
 +  push yh
 +
 +  call millis ​  ; call millis(): 4-byte return value in r25...r22
 +  ​
 +  // Use Y as a pointer to fetch the next time to switch the LED
 +  ldi yl, lo8(nextSwitchAfterMillis) ​
 +  ldi yh, hi8(nextSwitchAfterMillis)
 +  ​
 +  ld r18, y
 +  ldd r19, y+1
 +  ldd r20, y+2
 +  ldd r21, y+3
 +  ​
 +  ldd r17, y+4
 +  ldd r16, y+5
 +  ​
 +  // Compare nextSwitchAfterMillis with value returned by millis()
 +  cp r22, r18
 +  cpc r23, r19
 +  cpc r24, r20
 +  cpc r25, r21
 +  ​
 +  brcs tooEarly ; carry is clear if r18...r21(nextSwitchAfterMillis) < r22...r25(millis())
 +
 +  // Rotate left & right; bit travels up then down through r17 and r16
 +  // Carry is already clear because brcs was not taken, so no need for CLC
 +  rol r17
 +  sbrc r16,0
 +  ori r17,1
 +  ror r16
 +
 +  // Update the memory storage with the new rotated values
 +  std y+4, r17
 +  std y+5, r16
 +
 +  // Combine left and right shifter
 +  or r17, r16
 +  ​
 +  // 6 high bits of port D
 +  mov r16, r17
 +  andi r16, 0xFC
 +  out _SFR_IO_ADDR(PORTD),​ r16
 + 
 +  // 2 low bits of port B
 +  andi r17,0x03
 +  out _SFR_IO_ADDR(PORTB),​ r17
 +  ​
 +  // Add delay (subtract negative delay because there is no addi/adci)
 +  // to result of call to millis()
 +  subi r22, lo8(-delay)
 +  sbci r23, hi8(-delay)
 +  sbci r24, hlo8(-delay)
 +  sbci r25, hhi8(-delay)
 +      ​
 +  st y, r22
 +  std y+1, r23
 +  std y+2, r24
 +  std y+3, r25
 +  ​
 +tooEarly: ​
 +  pop yh
 +  pop yl
 +  ​
 +  ret
 +
 +.data 
 +
 +nextSwitchAfterMillis:​
 +.long 0
 +
 +shifters:
 +.byte 1
 +.byte 0
 +</​code>​
 +
 +This last version is 590 bytes - 20 bytes less than the C compiler. ​
 +
 +Conclusion: the C compiler is doing a pretty good job, and the extra effort of writing things in assembler is probably rarely worth it. Subtracting the 462 bytes for the runtime overhead from both compiled sketch sizes, we're looking at saving a little over 10% in size for this particular exercise. ​
 +
 +Nevertheless,​ sometimes 20 bytes can be the difference between 'it fits' and 'it does not fit', and, more importantly,​ I love tinkering, so I'll do some more assembler, just because I can!
 +
 +Now, just for kicks, a tiny change - change the last listing so it ends as follows:
 +
 +<​code>​
 +...
 +shifters:
 +.byte 1
 +.byte 0x80
 +</​code>​
 +
 +Run it again. Is that cool, or what?
 +
 +===== Feedback from readers =====
 +
 +Hi Kris!
 +
 +Thank you very much for your Arduino asm introduction. I helped me a lot getting started.
 +
 +But then I tried to modify version 1.5.2 the same way, because according to the developer of the arduino eclipse plugin, this is the latest version compatible with the plugin. And I found out that you need to modify the file .../​hardware/​arduino/​avr/​platform.txt,​ too. So I thought, maybe you want to mention this in your introduction.
 +
 +in the "AVR compile patterns"​ section, adding the following solved the issue:
 +
 +## Compile S files
 +recipe.S.o.pattern="​{compiler.path}{compiler.c.cmd}"​ {compiler.S.flags} -mmcu={build.mcu} -DF_CPU={build.f_cpu} -D{software}={runtime.ide.version} {build.extra_flags} {includes} "​{source_file}"​ -o "​{object_file}"​
 +
 +Regards,
 +
 +Ralf
  
 ===== Stuff collected on the Internet ===== ===== Stuff collected on the Internet =====
smallcpus.1354424080.txt.gz · Last modified: 2012/12/02 17:54 by kris