

#### Enhance System Performance and Productivity by Leveraging DSP and Embedded Technologies in FPGA Designs



- Embedded and digital signal processing (DSP) design challenges and solutions
- DSP coprocessing
- Nios<sup>®</sup> II C-to-Hardware (C2H) Acceleration Compiler
- Quartus<sup>®</sup> II software highlighted features
- Conclusion

#### **Product Evolution**



#### Constant Demand for New Features; Higher Performance and Lower Costs



© 2007 Altera Corporation—Public

Altera, Stratix, Arria, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

#### **Embedded and DSP Design Challenges**

#### Productivity



#### Performance

#### Flexibility

#### FPGAs Tackle These Challenges Head On

© 2007 Altera Corporation—**Public** Altera, Stratix, Arria, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation



#### **Solution on Productivity–Tool**



# **Solution on Flexibility–FPGAs**



FPGA is the Poster Child for Flexibility; Rapidly Prototype System and Feature Fill Over Time

© 2007 Altera Corporation—**Public** Altera, Stratix, Arria, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

# Solution on Performance (1): FPGA Single Chip





© 2007 Altera Corporation—Public

Altera, Stratix, Arria, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

#### **Solution on Performance (2): DSP+FPGA Coprocessing**



© 2007 Altera Corporation—Public

Altera, Stratix, Arria, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation



#### **DSP Coprocessing**

# **CPU Challenges**

- Processor speedup isn't there
- Performance limited by power
- Memory bandwidth limitations
- Single-core performance reaching a limit



#### Multi-core announcements by Intel, AMD, ARM, and others



© 2007 Altera Corporation—Public

#### **Processor Model Challenges For HPC**

- Memory bandwidth limited by package and pin count
- Multiple caches required to keep the microprocessor busy
- Multi-processor cache coherency problem eats up performance gains
- Most of the power consumption is in the cache and related controllers
- But many HPC applications derive little or no benefit from cache

Source: Prof. John Wawrzynek, BWRC, UC Berkeley

© 2007 Altera Corporation—Public

11



## **Overall Customer Requirements**

- Performance: 10X 100X algorithm and 3X 50X application acceleration
- **Productivity**: Simplicity of the tool chain; reduce the effort
- Power: Better performance-to-power ratio
- Price: Compared to alternatives

## **Performance—FPGA Algorithm Acceleration**

- 10X-100X at algorithm level
- Typically 3X-50X at application level
- Varies by vertical
  - 10X for medical imaging
  - 20-50X for financial

| Application                                                     | Processor only                                  | FPGA Crocessing                          | Speed Up    |
|-----------------------------------------------------------------|-------------------------------------------------|------------------------------------------|-------------|
| Hough and inverse Hough<br>processing                           | 12 minutes processing time<br>Pentium 4-3 GHz   | 2 seconds of processing<br>time @ 20 MHz | 370x faster |
| AES 1MB data processing/crypto rate<br>Encryption<br>Decryption | 5,558 ms/1.51 Mbps<br>5,562 ms/1.51 Mbps        | 424 ms/19.7 Mbps<br>424 ms/19.7 Mbps     | 13x faster  |
| Smith-Waterman search34 from<br>FASTA                           | 6461 sec processing time<br>(Opteron)           | 100 sec FPGA processing                  | 64x faster  |
| Multi-dimensional<br>hypercube search                           | 119.5 sec (Opteron 2.2 GHz)                     | 1.06 sec FPGA @ 140 MHz                  | 113x faster |
| Callable Monte-Carlo analysis (64,000 paths)                    | 100 sec processing time<br>(Opteron 2.4 GHz)    | 10 sec of processing @ 200<br>MHz FPGA   | 10x faster  |
| BJM financial analysis<br>(5M paths)                            | 6300 sec processing time<br>(Pentium 4-1.5 GHz) | 242 sec of processing @ 61<br>MHz FPGA   | 26x faster  |
| Mersenne Twister random number generation                       | 10M 32-bit integers/sec<br>(Opteron-2.2 GHz)    | 319M 32-bit integers/sec                 | 3x faster   |



© 2007 Altera Corporation—Public



## **Co-Processing Architectures**



#### Intel Xeon<sup>®</sup> Architecture

- Uses Front Side Bus (FSB) Interconnect
- Latest North Bridge has FSB interface for each CPU
- Xeon Quad Core presentation available: <u>http://www.intel.com/pressroom/kits/quadcore/qc\_pre</u> ssbriefing.pdf

#### AMD Opteron<sup>™</sup> Architecture

- Uses HyperTransport Interconnect
- Industry-standard AMD64 technology
- Socket modules available for Opteron
- AMD Torrenza web site: <u>http://enterprise.amd.com/us-</u> en/AMD-Business/Technology-Home/Torrenza.aspx



#### © 2007 Altera Corporation—Public

Altera, Stratix, Arria, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

## **Commercially Available Platform**



- HyperTransport links, memory interface, power supply, heat-sink
- Usable with any AMD Opteron (or future Intel CSI-enabled CPUs) server
- Usable in rack-mount or high-density, "blade" server systems, where
  - Plug-in boards are not feasible

#### © 2007 Altera Corporation—Public

Altera, Stratix, Arria, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation







#### C2H (C to Hardware) Tool

© 2007 Altera Corporation—Public

#### **Boosting Software Performance**



#### If you choose a faster processor • More expensive \$\$\$\$ • Consumes more power

Requires board redesign



© 2007 Altera Corporation—Public

#### **Boosting Software Performance**



#### Multiply software performance. Accelerate only what's necessary. Don't pay for performance you don't need.



© 2007 Altera Corporation—Public

#### **Hardware Acceleration Flow**



Time



© 2007 Altera Corporation—Public

#### **Hardware Acceleration Flow**



Time



© 2007 Altera Corporation—Public

#### **Nios II C-to-Hardware Acceleration Compiler**

- Productivity tool that automates creation and integration of hardware accelerators
- Streamlines C acceleration—you don't have to know how to design hardware
- Integrated in familiar Eclipse-based Altera<sup>®</sup> Nios II Integrated Development Environment (IDE)

|                                         | ferences   |                                                      |                   |
|-----------------------------------------|------------|------------------------------------------------------|-------------------|
| alt_u32*X+Y Add W<br>write_ptr = Accele | ne At Line | Add Watch Expression  -Accelerate this Function Save | Right Click to    |
| - — Add w                               |            | N                                                    | <b>Accelerate</b> |



© 2007 Altera Corporation—Public

## **Step 1: Identify Software Bottlenecks**

```
main ()
  ...variable declarations ...
  init();
  while (!error && got data())
    do user interface();
    gather statistics();
    if (got_new_data())
      d transform(in buf, out buf);
    check_for_errors();
  cleanup();
```





© 2007 Altera Corporation—Public

## Step 2: Right Click to Accelerate

|                               | Undo                                                                                                             | Ctrl+Z                     |                |
|-------------------------------|------------------------------------------------------------------------------------------------------------------|----------------------------|----------------|
| main ()                       | Re <u>v</u> ert File                                                                                             |                            |                |
| {variable declar              | Cu <u>t</u><br>Copy                                                                                              | Ctrl+X<br>Ctrl+C           |                |
| init();                       | <u>P</u> aste                                                                                                    | Ctrl+V                     |                |
| while (!error &&              | Sh <u>i</u> ft Right<br>S <u>h</u> ift Left<br><u>C</u> omment<br>Uncommen <u>t</u>                              | Ctrl+/<br>Ctrl+\           |                |
|                               | Co <u>n</u> tent Assist                                                                                          | Ctrl+Space<br>Ctrl+Shift+N |                |
| do_user_interf                | Format                                                                                                           | Ctrl+Shift+F               |                |
| gather_statist                | Show in C/C++ Projects                                                                                           |                            |                |
| if (got_new_da                | Refactor                                                                                                         | •                          |                |
| d_transform<br>check_for_erro | <u>O</u> pen Declaration<br>Open Type Hie <u>r</u> archy<br>All Dec <u>l</u> arations<br>All Re <u>f</u> erences | F3<br>F4<br>•              |                |
| <pre></pre>                   | →I Run To Line<br>I► Resume At Line<br>W Add Watch Expression<br>Accelerate with the Nios II                     | C2H Compiler               | Execution Time |
|                               | Save                                                                                                             |                            | SOPC           |

WORLD

© 2007 Altera Corporation—Public

Altera, Stratix, Arria, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

## What Does Nios II C2H Compiler Do?

 Generates a custom hardware accelerator from an ANSI C function



C2H: Nios C-to-Hardware Acceleration Compiler

© 2007 Altera Corporation—Public

#### **Dramatic Performance Boost**



#### **EEMBC Image Rotate**

#### (95 MHz)



No compiler optimization

Compiler optimization, 1200 LEs for Nios II/s processor, 1200 LEs for accelerator

As fast as a 1.4-GHz processor for \$1.42 of logic in a Cyclone<sup>®</sup>

© 2007 Altera Corporation—Public

Altera, Stratix, Arria, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

Performance



Quartus II Highlighted Features •SOPC Builder •PowerPlay •TimeQuest

## **SOPC Builder – The Tool**

#### Automates block-based design

- System definition
- Component integration
- System verification
- Software generation
- Fast and easy
- Supports design reuse
  - 3<sup>rd</sup> Party intellectual property (IP) Cores
  - Internally developed IP





© 2007 Altera Corporation—Public

#### **SOPC Builder Tool at a Glance**



© 2007 Altera Corporation—Public

## **SOPC Builder Generated System**



**Designer Only Needs to Worry About Peripheral Interface** 



© 2007 Altera Corporation—Public

## **Design Tool Flow**



- Program FPGA on the board
- Create software and run on the processor on the FPGA

© 2007 Altera Corporation—Public



#### **Quartus II Software: PowerPlay**



© 2007 Altera Corporation—Public

#### **PowerPlay Power Analyzer**

- Provides single interface for vectorless and simulation-based power estimation
- Uses improved power models
  - Based on HSPICE and silicon correlation
- Executing power analysis
  - Processing menu  $\Rightarrow$  Start  $\Rightarrow$  Start PowerPlay Power Analyzer
  - Scripting

#### **Three Parts to Good Power Estimates**

- 1. Accurate toggle rate data on each signal
- 2. Accurate power models of device circuitry
- 3. Knowledge of device operating conditions



#### **PowerPlay Power Analyzer**



© 2007 Altera Corporation—Public

#### **PowerPlay Power Inputs**

- Signal activity file (.SAF)
  - ASCII text file generated by Quartus II software
- VCD
  - Generated by Quartus II software and 3<sup>rd</sup>-party simulators
- "Power Toggle Rate" and "Power Static Probability" assignments
  - Use Assignment Editor or Tcl file
  - Apply to specific entities/nodes
- Default toggle rate (12.5%)
  - Percentage of clock periods in which signal transitions
  - May also express as an absolute number of transitions per second



© 2007 Altera Corporation—Public

## **Other Input Data Used**

- Operating conditions
- Clock timing assignments
  - Used to calculate internal signal activities
- Vectorless estimation
  - PowerPlay automatically derives signal activity for a node
  - Based on activity rates of signals feeding a node and functionality
  - Requires input signal activity data
- Capacitive loading
- Termination
- I/O standard



© 2007 Altera Corporation—**Public** Altera, Stratix, Arria, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 37

#### **PowerPlay Power Analyzer Settings**

#### Enter single or multiple SAF/VCD files

- Allows simulation of subdesigns separately
- Enable glitch filtering to increase accuracy
  - Also recommend enabling glitch filtering during simulation
- Enter default toggle rates for inputs
- Enter toggle rate for rest of design
- Enable/disable vectorless estimation

| General PowerPlay Power Analyzer Settings                                                                                            |                                                                                                                                                    |                                                |          |          |  |  |
|--------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|----------|----------|--|--|
| Pass     User Libraries (Current Project)     Device     Operating Conditions     Compilation Process Settings     EDA Tool Settings | Select the power analyzer options.<br>Use input file(s) to initialize toggle rates and static probabilities during power analysis<br>Input File(s) |                                                |          |          |  |  |
| Analysis & Synthesis Settings                                                                                                        | File name                                                                                                                                          | Type                                           | Entity   | Add      |  |  |
| <ul> <li>Fitter Setting:</li> <li>Timing Analysis Settings</li> </ul>                                                                |                                                                                                                                                    |                                                |          | Edit     |  |  |
| - Assembler                                                                                                                          |                                                                                                                                                    |                                                |          | Bemove   |  |  |
| <ul> <li>Design Assistant</li> <li>SignalTap II Logic Analyzer</li> </ul>                                                            |                                                                                                                                                    |                                                |          |          |  |  |
| - Logic Analyzer Interface                                                                                                           |                                                                                                                                                    |                                                |          | D        |  |  |
| <ul> <li>Simulator Settings</li> <li>PowerPlay Power Analyzer Settings</li> </ul>                                                    | Perform gitch filtering                                                                                                                            | on VCD files                                   |          |          |  |  |
| PowerPlay Power Analyzer Settings                                                                                                    |                                                                                                                                                    |                                                |          |          |  |  |
|                                                                                                                                      | Write out signal activities up                                                                                                                     | ed during power anal                           | /585     |          |  |  |
|                                                                                                                                      | Output file name:                                                                                                                                  |                                                |          |          |  |  |
|                                                                                                                                      | Vite signal activities to report file                                                                                                              |                                                |          |          |  |  |
|                                                                                                                                      |                                                                                                                                                    | Vite power dissipation by block to report file |          |          |  |  |
|                                                                                                                                      | Default toggle rates for unspec                                                                                                                    | _                                              |          |          |  |  |
|                                                                                                                                      | Default toggle rate used for in                                                                                                                    |                                                | 5 X      | <u>×</u> |  |  |
|                                                                                                                                      | Default toggle rate used for                                                                                                                       |                                                |          |          |  |  |
|                                                                                                                                      | C Use default value: 12                                                                                                                            | 5 %                                            | <u>×</u> |          |  |  |
|                                                                                                                                      | <ul> <li>Use vectoriess estimation</li> </ul>                                                                                                      | n                                              |          |          |  |  |
|                                                                                                                                      |                                                                                                                                                    |                                                |          |          |  |  |



© 2007 Altera Corporation—Public

#### **Faster TimeQuest Timing Analyzer**

Improves productivity with faster timing closure

- Improved compile times
- Reduced memory usage
- Improved timing constraint conversion from Altera's classic timing analyzer to Synopsys design constraint (SDC)





© 2007 Altera Corporation—Public

39

# **TimeQuest Timing Analyzer**

- Timing analysis
  - New, easy-to-use timing analyzer
  - Complete GUI environment for creating timing constraints and reports
  - Native support for SDC (Synopsis Design Constraints)



# Only 65-nm FPGA Vendor with Native SDC Support



© 2007 Altera Corporation—Public

## **Top 5 Reasons to Use TimeQuest**

- Easier to use: TimeQuest provides an easier to use GUI and interactive reporting for analyzing timing
- Industry standard: SDC format is an established industry standard
  - Simpler and more concise timing format
- More powerful: SDC allows for faster, easier description and analysis of advanced design constructs
   DDR (other source sync.), complex clocks
- Designs run faster: TimeQuest more precisely analyzes timing behavior—gain 3-5% performance at 65 nm
- Interoperability: allows for easy migration of SDC constraints for ASIC and HardCopy<sup>®</sup> designs



© 2007 Altera Corporation—Public

#### **Quartus II Reference**

- Quartus II handbook
  - www.altera.com/literature/lit-qts.jsp
- Quartus II online demos
  - www.altera.com/quartusdemos
- Quartus II downloads
  - <u>www.altera.com/download</u>
- Technical support
  - www.altera.com/mysupport

© 2007 Altera Corporation—**Public** Altera, Stratix, Arria, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation





# Conclusion

#### FPGAs, Tools, and DSP–Coprocessing Enhances Performance Together



© 2007 Altera Corporation—Public

### Conclusion

- Embedded and DSP design challenges- productivity, performance, and flexibility
- DSP coprocessing, C2H tools, and new features in Quartus II help tackle those challenges
  - Coprocessing provides unparalleled performance improvement
  - C2H tools provide ability to create performance-enhancing hardware automatically (simply <u>*Right Click to Accelerate*</u>) without leaving the C domain
  - Quartus II SOPC Builder automates block-based design easily and efficiently; Powerplay automates power reduction for maximum productivity; Timequest facilitates timing analysis for 65nm era and beyond



© 2007 Altera Corporation—Public



# Thank You!