[Product Innovation]
Quad 64-Bit Multiprocessor Targets Comm Applications
Construct a super computer or super switch using chip’s advanced HyperTransport/SPI-4 links.
Getting data to and from a processor quickly is key to high-performance network processing. Broadcom's new BCM1400 multiprocesssor tackles this problem with a trio of flexible advanced HyperTransport/SPI-4 Phase 2 links. Of course, packing four 64-bit MIPS processors into the same package didn't hurt either. The result is a chip that provides multiprocessing support alone or in an array of HyperTransport linked chips.
The BCM1400 targets communication-oriented applications that need significant computational support, like Internet service routers and switches with deep content switching and differentiated services such as quality-of-service (QoS) and virtual private networks (VPNs). In addition, the BCM1400 addresses Internet-Protocol (IP) servers and subscriber-management platforms, servers supporting high computational re- quirements for scientific or Enterprise Java environments, and wireless infrastructure equipment. The multiprocessing architecture also makes it suitable for scientific and embedded applications requiring significant computational capabilities.
The chip contains a number of peripherals along with its sophisticated memory and communication support (see the table). Up to eight chips can be connected via the HyperTransport links, for a 32-processor symmetrical multiprocessing (SMP) system (see "Multifunctional HyperTransport," p. 48).
Differentiating the BCM1400 SMP support from most small-scale SMP systems with two to eight processors is its use of a nonuniform memory access (NUMA) architecture. This is similar to the NUMA used with AMD's new Opteron 64-bit CPU. The NUMA architecture is often used by medium-scale microprocessor systems with eight to 32 processors. Broadcom's solution is unusual because of its high integration, low power consumption, and multiplexing of memory and I/O traffic on the same link.
In a conventional SMP system, all processors have the same memory access time. A bus or switch acts as an interface between processors and the memory subsystem. Cache coherence is maintained by monitoring the bus or the switch traffic.
With NUMA, the memory address space is made up of the combined local memory from each node in the system. A processor can access its local memory faster than nonlocal memory. NUMA systems have the advantage of being easily expanded, while adding a processor to a conventional SMP shared memory architecture is more difficult because an additional port is needed.
Broadcom uses a cache-coherent form of NUMA, or ccNUMA. This allows on-chip caches to remain up to date even while data moves through the processor/memory interconnect. The BCM-1400's on-chip double-data-rate (DDR) memory controller supports the chip's local, off-chip memory. Its HyperTransport links provide ccNUMA support.
Three-Way HyperTransport/SPI-4 Links: The BCM1400's triple HyperTransport link architecture is critical to its use in communication and multichip multiprocessing support (see the figure). Each link can be configured as an 8- or 16-bit HyperTransport connection, or as a streaming SPI-4 interface. The SPI-4 support includes hardware hash and route acceleration functions.
In addition, the HyperTransport links work with a mix of HyperTransport transactions, including encapsulated SPI-4 packets and nonlocal NUMA memory access.
The key is that hardware handles movement of in-formation. For ex-ample, nonlocal memory accesses are determined by the memory mapping hardware that generates a HyperTransport request for reads or writes. These packets are automatically routed to the proper node that handles memory requests via its local memory. Operating systems simply set up the memory maps and HyperTransport links.
Although ccNUMA incurs an access-time penalty, the effects of using nonlocal memory are mitigated by on-chip caches and the HyperTransport transfers that occur at high speeds. So there's an initial delay when filling a cache entry. But subsequent memory accesses by a processor happen at faster cache speeds than even local memory accesses.
Code prefetching effectively masks the latency of the system. A large 1-Mbyte, level 2 cache per BCM1400 means that only small, random, nonlocal memory accesses will cause any significant slowdown. Moving large amounts of sequential memory via nonlocal memory isn't a problem as only the transfer initiation incurs a latency penaltya small fraction of the time necessary to send the block of data. The 64-kbyte level 1 cache per processor is split between a 32-kbyte instruction and 32-kbyte data cache.