Electronic Design

  
Reprints     Printer-Friendly    Email this Article    RSS        Font Size     What's This?


[Forefront]
Highly Parallel DSP Architecture Makes Quick Work Of Image-Processing Algorithms

Dave Bursky  |   ED Online ID #2353  |   June 24, 2002


The combination of a high-speed RISC processor and an array of DSP blocks that form a single-instruction/multiple-data (SIMD) parallel processor delivers 4000 MIPS of compute throughput. Targeted at image-processing applications, the CW4011 visual signal processor (ViSP) is the first implementation of an embeddable core developed by ChipWrights Inc. of Newton, Mass.

This core supports a vector array of up to 32 fully pipelined DSP units that can perform 32 multiply-accumulates every instruction cycle. It can therefore deliver 8500 MMACs (when performing 32 8-bit MACs/cycle) when clocked at 266 MHz. Designed for low-power operation, it consumes just 0.1 mW/MIPS when powered by a 1.5-V supply.

That low power consumption suits system-on-a-chip (SoC) solutions for portable imaging applications like digital cameras. The image-processing capabilities can also be used in laser and color printers, scanners, image transcoders, photo kiosks, and digital color copiers. The CW4011 can deliver real-time video processing, too. When running MPEG2 encoding, it can deliver main-level, main-profile (MP@ML) coded data.

To achieve the high throughput, ChipWrights crafted the heart of the CW4011—the CWv8 processor core—to include both a RISC processor and the vector SIMD array (see the figure). The RISC processor (dubbed the serial datapath unit since it executes instructions sequentially) coordinates the algorithmic operations and has a moderate throughput of about 150 MIPS. The serial datapath lets the entire core function as the master CPU in many embedded applications, eliminating the need for a separate host processor.

The processor's SIMD portion enables one instruction to be simultaneously executed by a number (two, four, eight, or 16) of DSP processors called parallel datapath units. The CW4011 implementation uses eight parallel datapath units. Each performs operations on 32-bit longwords, 16-bit words, or 8-bit bytes. Included in each datapath unit is a 31-word by 32-bit register file, an extractor, a multiplier, an ALU with accumulator, and an inserter. Individual datapaths can be enabled or disabled during operation. The software can then provide some degree of power management.

The multiplier unit in the parallel datapath performs 32-bit by 16-bit multiplication using two's complement number representation. When called on to perform multiplications on smaller words, it can be logically subdivided to perform two 16- by 16-bit or four 8- by 16-bit multiplications in a single cycle, or a full 32- by 32-bit multiplication in two cycles. The multiplier can be configured to compute vector dot products or compute the sum of absolute differences, which is handy for frame-to-frame motion estimation in MPEG codecs.

The serial datapath unit resembles a conventional RISC CPU and acts as a coordinating processor for the parallel datapaths, providing address and extract/insert information. It can be used to access control registers and manage the program counter. Also, it includes a register file of 32 longword registers. Though this unit has its own set of RISC-type instructions, it shares the same instruction stream as the parallel datapaths.

In addition to datapath units, the CW4011's processor portion includes an instruction cache that can hold up to 2048 32-bit instructions and is direct-mapped to the datapaths. A primary memory block organized as four interleaved banks of 8 kwords by 32 bits each (128 kbytes total) is part of the processor block. Up to four 32-bit longwords can be written or read in each instruction cycle, reducing memory bandwidth bottlenecks. A direct-memory-access controller and system bus controller manage data exchanges between the cache, the primary memory, and off-chip data sources and destinations.

To make programming as easy as possible, the visual signal processor can be programmed in a high-level language such as C or in the core's native assembly language, CAS. Support for software development is available with the CodeWarrior software development tools from Metrowerks. The full development kit includes an optimizing C compiler, a cycle-accurate software simulator, a visual debugger and assembler that works over the core's JTAG test port, and a performance profiler.

The CW4011 is a test chip that the company developed to demonstrate performance. Along with the datapaths and supporting caches and logic, it includes an SDRAM controller that allows 16- or 32-bit data paths and operation at up to 133 MHz, a 16-bit host-peripheral interface, an 8/16-bit wide interface port, and a host of basic peripherals—DMA channels, counters, SPI and UART serial ports, and up to 32 general-purpose I/O pins.

For core licensing information, contact ChipWrights at www.chipwrights.com or (617) 928-0100.


Reprints   Printer-Friendly  Email this Article  RSS    Font Size   What's This?


  • Automating Analog IP Process Migration
  • C Tools Accelerate HDV Development On Xilinx FPGAs
  • A New Design Inflection Point
  • Forecasting Industry Growth For 2009 And Beyond
  • EDA Retools To Exploit Multicore Architectures
  • Design And Verification Move Up In Abstraction
  • EDA Retools To Exploit Multicore Architectures
  • A New Design Inflection Point
    1) Build A Smart Battery Charger Using A Single-Transistor Circuit
    (234 views today)
    2) Transportation Guidelines For Lithium Batteries Get Updated
    (232 views today)
    3) The Field Of Energy Harvesting Begins To Ripen
    (124 views today)
    4) 2008 BEST Electronic Design Winners
    (114 views today)
    5) Easily Convert Decimal Numbers To Their Binary And BCD Formats
    (101 views today)
    ALL TOP 20



    Reader Comments

    It would be good to mention in your article whether the core of the processor is available or not and how much is the price. I am searching for a processor VHDL/Verilog core that can be used in a chip for image processing

    Sara Bolouki -December 09, 2003

    POST YOUR COMMENTS HERE
    Name:

    Email:
    Your Comments:

    Enter the text from the image below


    Please refresh the page if you have trouble reading this text.

    Search Electronic Design
         
      
     
    Web Seminar
    Sponsored By:
    Title: Read Pacing: A Performance Enhancing Feature of PCI Express Gen 2 Switch Devices
    Speakers: 
    Date: 07/01/08
    Register: 

    Electronic Design Europe Electronic Design China EEPN Power Electronics Auto Electronics Microwaves & RF
    Mobile Dev & Design Schematics Find Power Products Military Electronics EE Events Related Resources