Multimedia is taking off like gangbusters in entertainment, education, and medicine. But the huge amounts of datatext, speech, music, images, graphics, and videomake representation a challenging task indeed. Admittedly, much work has been done in the fields of effective representation by means of compression, storage, and transmission. Yet only scant attention has been paid to content accessibility and manipulationthat is, until MPEG-4 arrived.
Through object-based representation, the crucial ingredient that distinguishes it from earlier versions, MPEG-4 enables users, for the first time, to combine graphics, text, and synthetic/natural objects into a single-bit stream.
Another attribute that makes MPEG-4 so attractive for wireless technology is its support for scalable content. In fact, one of the initial goals that the developers had in mind was to provide tools and algorithms for very low bit-rate coding of audio/visual data. This means you encode just once, but acquire complete freedom to play back at different rates with acceptable quality for whatever communications environment is at hand. For example, in a mobile, video-phone telephony application, the user can request a higher frame rate and spatial resolution for the talking person, and a lower rate for the background objects.
But there's a catch. MPEG-4 is the most complex standard yet in the multimedia sector. Several papers at the International Solid-State Circuits Conference addressed these issues, finding ways to enhance performance and minimize power consumption. One is a Session 9 paper entitled "A 90-mW, MPEG-4 Video CODEC LSI With The Capability For Core Profiles." Its authors are with Matsushita Electric Industrial Co. Ltd., Fukouka, Osaka, and Kanagawa, Japan.
This chip contains approximately 31 million transistors on an 8.8- by 8.6-µm die. It's made on a 0.18-µm, 1.8-V, quad-metal CMOS process.
High performance, high flexibility, low power, and low cost are all necessities in an LSI design to optimize services based on object-based coding in mobile visual applications. As the authors point out, though, power consumption of high-performance processors is high. They were, however, able to devise dedicated hardware that uses less power while delivering higher performance than a software implementationeven though the latter might be tailored to fit the defined function.
The chip comprises a 20-Mbit embedded DRAM, a programmable DSP, and eight dedicated hardware engines (Fig. 1). It can simultaneously encode and decode 15 QCIFs or quarter common intermediate formats (176 by 144 frames/s for H.263 and MPEG-4 simple profile@Level 1). It decodes at 30 CIF, which is 352 by 288 frames/s for simple profile@Level 3, and 15 QCIF frames/s for core profile@Level 1 with four objects. When operating at 54 MHz and performing simple@L1 simultaneous encoding and decoding, as well as core@L1 decoding, the chip consumes only 90 mW. There also are three interface units, which include a video processing unit, a memory interface, and a host interface.
The DSP core employs vector pipeline architecture. The chip has two types of dedicated hardware engines. One, in the vector pipeline, performs operations like DCT/Q, IQ/IDCT, and DCT/IDCT. Post-noise reduction and composite engines are of this type. The other can be thought of as a coprocessor, with the engine and the DSP each performing independent operations. Motion estimation, variable-length coding, variable-length decoding, padding, and context-based, binary arithmetic decoding all fall into that category.
Each block uses clock gating, reducing power consumption by 60%. When any of the dedicated hardware engines completes a task, its clock is disabled until the DSP starts the engine the next time.
The three dedicated hardware engines devoted to core profile decoding are the context-based, binary arithmetic decoding, padding, and composite engines. The context-based, binary arithmetic decoding engine decodes the shape data by one binary alpha block. Note that a software implementation couldn't execute the context-based, binary arithmetic decoding at high speed, due to the many bit operations and the complex conditional branching.
To reduce power consumption in the external I/O circuits, the chip employs an embedded DRAM. A total of four, 4-Mbit DRAM macros for the core functions and two 2-Mbit DRAM macros for the display are integrated into a single chip. This equals a total of 20 Mbits.
In the case of a DRAM, the higher the access activity, the larger the access current. Also, the access current depends on the memory capacity per macro. Successively dividing the DRAM micro into smaller and smaller slices diminishes the power consumption of the embedded DRAM. The area of the multi-DRAM micros, however, becomes larger in comparison to the single-macro scheme. A configuration comprising four 4-Mbit DRAM macros is used here. The access activity of simple@L1 simultaneous encoding and decoding is about 15%, whereas the one including graphics data has around 50% in estimation. Note that the access activity for the work and the display area are separate.