Speeding up LPD8806 show() without hardware SPI

If you’re using LPD8806 LED strips and you can’t use the hardware SPI port (e.g., when using an Ethernet board), there are two other options in the Adafruit library: the default mode and ‘slowmo’ mode. The default mode is decent, but the flexibility of being able to choose the pins at runtime comes with a cost.

However, you can still get a decent speedup by defining your pin usage at compile time in a replacement show() function. I measured the time required to update an 86 LED strip using each method on an EtherTen board (Atmega328 @ 16 MHz, same as the Uno):

30.23 ms - Adafruit 'slowmo' method (digitalWrite)
7.76 ms - Adafruit default method (port pointers)
1.54 ms - Compile-time method
1.43 ms - Adafruit hardware SPI method

The timing was done using micros() around the show() call with strip.pause = 0.

I’ve tried to make this method as minimally hardcoded as possible. To use, throw the code from CompileTimeLEDs.h into the LPD8806 class in LPD8806.h and replace:

int ClockPin = 3;
int DataPin = 2;


const int ClockPin = 3;
const int DataPin = 2;
strip.showCompileTime<ClockPin, DataPin>();

This overload is only available on ATmega168 or ATmega328 boards; on the Arduino Mega or other random boards, you need to specify the port register and use pin offsets within the port instead of the Arduino board pin number (e.g., showCompileTime<0..7, 0..7>(PORTD, PORTD))

CompileTimeLEDs.h (Download)

template<unsigned int ClockPin>
void PulseClockLine(volatile uint8_t& ClockRegister)
  const byte LED_CLOCK_MASK = 1 << ClockPin;
  ClockRegister |= LED_CLOCK_MASK;
  ClockRegister &= ~LED_CLOCK_MASK;
template<unsigned int ClockPin, unsigned int DataPin>
void TransmitBit(byte& CurrentByte, volatile uint8_t& ClockRegister, volatile uint8_t& DataRegister)
  // Set the data bit
  const byte LED_DATA_MASK = 1 << DataPin;
  if (CurrentByte & 0x80)
    DataRegister |= LED_DATA_MASK;
    DataRegister &= ~LED_DATA_MASK;
  // Pulse the clock line
  // Advance to the next bit to transmit
  CurrentByte = CurrentByte << 1;
#if defined(__AVR_ATmega328P__) || defined(__AVR_ATmega168__)
  #define MAP_ARDUINO_PIN_TO_PORT_PIN(ArduinoPin) \
    ( ArduinoPin & 7 )
  #define MAP_ARDUINO_PIN_TO_PORT_REG(ArduinoPin) \
    ( (ArduinoPin >= 16) ? PORTC : (((ArduinoPin) >= 8) ? PORTB : PORTD) )
  // Specify Arduino pin numbers
  template<unsigned int ClockPin, unsigned int DataPin>
  void showCompileTime()
  // Sorry: Didn't write an equivalent for other boards; use the other
  // overload and explicitly specify ports and offsets within those ports
// Note: Pin template params need to be relative to their port (0..7), not Arduino pinout numbers
template<unsigned int ClockPin, unsigned int DataPin>
void showCompileTime(volatile uint8_t& ClockRegister, volatile uint8_t& DataRegister)
  // Clock out the color for each LED
  byte* DataPtr = pixels;
  byte* EndDataPtr = pixels + (numLEDs * 3);
    byte CurrentByte = *DataPtr++;
    TransmitBit<ClockPin, DataPin>(CurrentByte, ClockRegister, DataRegister);
    TransmitBit<ClockPin, DataPin>(CurrentByte, ClockRegister, DataRegister);
    TransmitBit<ClockPin, DataPin>(CurrentByte, ClockRegister, DataRegister);
    TransmitBit<ClockPin, DataPin>(CurrentByte, ClockRegister, DataRegister);
    TransmitBit<ClockPin, DataPin>(CurrentByte, ClockRegister, DataRegister);
    TransmitBit<ClockPin, DataPin>(CurrentByte, ClockRegister, DataRegister);
    TransmitBit<ClockPin, DataPin>(CurrentByte, ClockRegister, DataRegister);
    TransmitBit<ClockPin, DataPin>(CurrentByte, ClockRegister, DataRegister);
  while (DataPtr != EndDataPtr);
  // Clear the data line while we clock out the latching pattern
  const byte LED_DATA_MASK = 1 << DataPin;
  DataRegister &= ~LED_DATA_MASK;
  // All of the original data had the high bit set in each byte.  To latch
  // the color in, we need to clock out another LED worth of 0's for every
  // 64 LEDs in the strip apparently.
  byte RemainingLatchBytes = ((numLEDs + 63) / 64) * 3;
  } while (--RemainingLatchBytes);
  // Need a bit of a delay before clocking again, but ideally this
  // is set to 0 and meaningful work is done instead
  if (pause)

License: CC0 (Placed into the public domain).

The avr-gcc compiler seems to do fine at recognizing the compile time constants and doing the right thing in codegen, as this method is just as fast as my original macrotastic implementation. You can probably eke out some more performance (would be nice to beat the hardware SPI implementation ^_^), but this was enough to get the headroom I needed.

19 thoughts on “Speeding up LPD8806 show() without hardware SPI”

  1. Nice hack, but I am still unsure of its necessity. Provided all the devices you want to connect are tolerant of the same bus speed, you can chain multiple devices onto SPI–it is a bus protocol, not a point-to-point one. You can either daisy-chain all the devices, connecting the MISO pin of one to the MOSI pin of the next around in a loop back to the microcontroller, or run the pins in parallel and use Chip Select pins to enable the device you want to talk to.

  2. Nice, I’ll have to start using templates in MHVLib:

    I eke out a little bit more performance in the MHV_Shifter class by condensing the clock and data writes into a single operation (by introducing the requirement that the clock and data pins are on the same port), and unroll the byte iterating loop to avoid a branch (I instead spend a branch deciding whether to write a 0 or 1).

  3. Ah, about time. I’m tired of people saying that the C++ in the Arduino environment just slows it down.

    When used right, like here, it’s both easier to use and faster.

    Thank you,

  4. I’m unfortunately only minimally experienced with C++, but very interested in using this speed enhancement.  I have tried placing your template code in the header in a variety of locations, but none have seemed to compile successfully.  Can you elaborate on exactly how the code should be merged?

    1. You can insert the code into the class anywhere inside the public: section of the LPD8806 class, or you can save the file in the same directory as LPD8806.h and just add #include “CompileTimeLEDs.h” into the class at the same place.
      I have it inserted just after the definition of getPixelColor, so for the version of LPD8806.h currently up on GitHub, that would be line 28.

  5. I have tested the code on an Arduino Uno (16MHz), with a full five-meter strip of 160 LEDs (I purchased from http://www.bestlightingbuy.com/waterproof-lpd8806-flexible-rgb-led-lighting-strip.html). At this size, the original software emulation took around 200ms to update all 160 LEDs. This gave me a refresh rate of only 5fps.

  6. Hi Michael,
    This is definitively a great idea. 
    I have 16 strips to manage on a Mega so using your compile-time trick is going to save me some hardware.

    However, I’m facing a issue as I don’t get the same timing has you have.
    For the same parameters, on a UNO (don’t get the Mega yet), I reach only 2.796ms for a showCompileTime(PORTD,PORTD).

    Using my logic-analyser I see that :
    clock time hi    = 350ns
    clock time low = 900ns

    I though that the TransmitByte and PulseClock were not inlined so changed to macros.
    But no change in the result.

    So I am stuck at almost 50% of your performance.
    Note that with SPI, I do have the expected perf.

    Any hint on what I shall look for would be appreciated.

    1. Hi Barbus,
      Since changing those functions to macros didn’t help, the compiler is probably still inlining them correctly, but something is clearly wonky.  So are you’re seeing 2.79 ms for a totally unmodified LedSpeedTest.ino?  If so, I’d be interested in what timing you see for TEST_MODE = TEST_DEFAULT.

      RE: Seeing the results of the template: Not that I’m aware of.  Templates are a native C++ feature and are handled during compilation, instead of during preprocessing like a macro.  They don’t transform source code per se, so there’s not really an extended version to look at afterwards.

      However, you should be able to see the resulting assembly, either with a command line option to the compiler or via objdump on the generated .o files, but I haven’t tried to get either working yet with the Arduino environment.  From there you could diff it to your hand-hardcoded function to see what is going wrong.

      Cheers,Michael Noland

  7. Keep an eye out for the new version of the FastSPI_LED library. I’ve just finished re-writing the core of it – and on a 16Mhz arduino, I can push out 3.1Mbps with bitbanging – or .712ms to run through your 86 led test case. With the SPI clock set at 2Mhz, I’m getting 1.2ms, at 4Mhz I’m getting .779ms, and at 8Mhz i’m getting .353ms.

    (Also, the library will switch between hardware and bitbang’d SPI based on what pins you tell it to use, behind the scenes, as well as only taking 250-750 bytes of your program space (down from nearly 12k).

    Oh – also it supports over a half dozen different LED chipsets 🙂 And now supports latching for SPI if you want to add some AND gates to share the hardware SPI channel with something else.

      1. Hopefully in the next week or two – wrapping up some testing with it now – i’ll probably do a version for the avr based platforms first, and follow up with the teensy 3 later (and then chipkit and msp430 platforms)

  8. Michael, your post got me thinking about more ways to abuse templates with AVR (I had already started thinking about this in some directions – this definitely went further) – the library that’s using this is now up here – http://waitingforbigo.com/2013/02/19/fastspi_led2_preview_release/ – also http://waitingforbigo.com/2013/02/19/introducing-fastspi-for-most-of-your-data-pushing-needs/ talks a little bit more specifically about the things that I did. I need to do up one more post on how I abused types and static functions (and a little bit of the C preprocessor) to remove the need to pass the data and clock register around.

    On a 16Mhz arduino, I’m now pushing 6.6Mbps with the hardware SPI (running at 8Mhz) and 3.1Mbps+ with -software- SPI.

    There’s still room to improve more, I think. There’s always room for faster – I have a path for possibly squeezing another 25-50% -more- performance out of the software SPI.

Leave a Reply

Your email address will not be published. Required fields are marked *