Intel/Analog Devices' joint DSP design has crafted a new flexible ISA architecture. Analog Devices' (ADI's) BlackFin implementation delivers a 300-MHz, 16-bit DSP that supports dual MAC execution and low-power operation.
This is the latest of the fourth-generation DSPs that have emerged to power today's network, Internet-driven applications. Its competitors include ADI's TigerSHARC, Agere/Motorola's StarCore, and Texas Instruments' C6x. The new Micro Signal Architecture (MSA) will give them a run for their money. The 16-bitter can scale to 1 GHz and beyond.
A 16-bit, dual-MAC DSP architecture, MSA builds on ADI's high-performance VLIW, SIMD DSP architectures, and on Intel's memory management, power management, performance monitoring, and SIMDs. The resultant DSP implements dynamic power reduction, a memory management unit (MMU), and performance monitoring.
MSA delivers an innovative multi-instruction ISA that supports high-density 16-bit instructions, 32-bit immediate instructions, and 64-bit DSP (packed) instructions. It can execute two 16-bit MACs, two 32/40-bit arithmetic, a 32/40-bit shift or rotate, or four 8-bit video instructions per pipelined cycle.
This fourth-generation DSP targets high-performance, midrange 16-bit DSP applications. It packs enough memory on-chip308 kbytesfor many tasks. In addition, the MSA supports low-power operation for portables and Internet appliances. Under software control, the core voltage and clock rates can be varied to cut power.
A balanced architecture, MSA supports both high code density and a simplified ISA. Listed are the keys to MSA's flexible instruction design:
Load/store architecture: work from registers
16-bit basic instruction: high code density
Extended 32-bit instruction: large immediates
Combined instructions: multi-issue instructions
The DSP was designed around 16-bit instructions for high code density, and most control instructions are 16-bitters. But for operations that need larger immediate values or more fields, the ISA was extended to a 32-bit instruction.
For DSP operations, the designers added a 64-bit multi-issue instruction. A composite, this instruction is made up of two 16-bit instructions and a 32-bit instruction. This combination can specify complex DSP operations with two data loads, but it only takes one instruction fetch. Even better, it can make use of the same decode logic already implemented for the standard 16-bit and 32-bit instructions. The decoder takes in a 64-bit wide pluck and can issue one, two, or three instructions per cycle.
For speed, the DSP is pipelined with eight stages. Two stages execute the dual MACs that feed into dual 40-bit accumulators. The pipeline can start a dual-MAC instruction every cycle, delivering apparent dual-MAC executions per cycle.
This DSP core breaks down into separate addressing and execution sections. The addressing section incorporates dual data addressing generators (DAGs), supported by a pointer register file of eight 32-bit registers and an addressing register file. The latter has four entries. Each entry contains a set of four 32-bit registersfor indexing, modification, length, and base address. These four entries support four addressing contexts, minimizing interrupt context saves. The execution section consists of two 16- by 16-bit multipliers, two 32/40-bit ALUs, quad 8-bit video ALUs, a 40-bit barrel register, and dual 40-bit accumulators.
This is a load/store architecture. The next set of operands for the dual-MAC operations are fetched as two 32-bit words from the L1 memory (D cache, scratchpad RAM) and loaded into 32-bit data registers. These furnish the next X and Y values to the DSP execution units on the next cycle. For dual MACs, the 16-bit operands are grouped in 32-bit setstwo X and two Y 16-bit values.
Also, for higher processing bandwidth, the hardware performs SIMD operationsi.e., the same operation passed through the four 8-bit video ALUs. This tactic speeds up video pixel processing by four times. The ALUs also accomplish dual 16-bit ALU or 32-bit ALU operations and shifts.
On-chip memory has two levels or stages. Level one interfaces the CPU. It has a 16-kbyte instruction cache, 32-kbyte data cache, and 4-kbyte scratchpad SRAM. These memories have a two-cycle access. They can load two 32-bit data words and one instruction to the core per clock cycle. The second level of larger SRAM functions as a unified memory (I and D). The L1 caches can be configured as SRAM, or mixed cache and SRAM. They also support cache locking.
To speed accesses, the hardware supports relaxed ordering between Loads and Stores. Loads can take precedence. Also, there are two write queues from the CPU to L1 memory and from L1 memory to the system interface. Addressing is byte and word level.
Designed for C/C++ coding, the ISA supports two software stacks (user, system), held in the scratchpad RAM for fast access. Plus, unlike many DSPs, the MSA supports I/D MMUs for memory protection. It supports emulation, system, and user execution modes. For coding simplicity, the assembler implements an algebraic notation.