

#### Overview of Talk

- High-Level Introduction
- Detailed Cell Behavior
- Benefits of the Architecture
- Applications of the system
- Status and future plans

## HIGH LEVEL DESCRIPTION OF A CELL MATRIX

#### What is a Cell Matrix?

- Configurable Hardware:
  - Fixed Hardware Structure
  - Variable Functionality provided by subsequent configuration step

#### What is a Cell Matrix?

- Configurable Hardware
- Fine-Grained
  - Simple set of *cells* with very simple behavior
  - More complex behavior is built up from collections of cells
  - Cells are individually configurable via per-cell configuration memory

#### What is a Cell Matrix?

- Configurable Hardware
- Fine-Grained
- Cells are interconnected via fixed, nearest-neighbor topology
  - Cells exchange information with their neighbors

## This Might Sound a Lot Like an FPGA

BUT...

# An FPGA is an **externally** configured device

- You send configuration information into the device, and the device responds accordingly.
- Configuration generation (external) is distinct from information processing (internal)



External Controller (PC, FPGA, etc.)



External Controller (PC, FPGA, etc.)



External Controlleı (PC, FPGA, etc.)



External Controller (PC, FPGA, etc.)

# A Cell Matrix is an **internally** configured device

- A cell within the matrix can write a neighboring cell's configuration memory (CODE)
  - This allows a cell to change a neighbor's behavior
- A cell can also read a neighbor's current configuration

## Code and Data Processing Within a Cell Matrix



Cell Matrix

#### Information Flow Inside a Cell Matrix

- In addition to cells exchanging DATA with each other, cells also exchange CODE
- Fundamental difference from FPGA

Also different from 2 FPGAs...

#### **KEY CONCEPT**

- Within a cell matrix, CODE and DATA are interchangeable
- Travel over same physical lines
- Indistinguishable from each other
- This is called Duality #1
  - Keep it in mind!

#### **DUALITY #1**

 Means you can build circuits which process configuration information in the same way they process data

#### **DUALITY #1**

 For example, you can store several configurations, select them via multiplexer, and so on, then use them to configure cells

### Example



#### Stranger Examples

- Adder could be incrementing a configuration string!
- You could logically OR two configuration strings

#### Duality #2

- The controlling cell and the cell being controlled are interchangeable
- No distinction between sources and targets
- Probably most important feature!

### Example of Duality #2



### Example of Duality #2



### Example of Duality #2











#### DETAILS OF CELL-LEVEL BEHAVIOR

### Single Cell Behavior



## Cells are tiled in regular fashion Nearest neighbor interconnect



#### Example: One-Bit Full Adder



#### Example: Four-Bit Adder



#### Example: ALU



f<sub>1</sub> f<sub>0</sub> X 0 0 A + B 0 1 A 1 0 A & B 1 1 A I B

### Example: D-Flip Flop



#### Example: Memory



## What About Code Processing?

- Each cell operates in one of two modes
- Above examples show cells in D mode, where they process information as Data
- A D-mode cell's configuration memory is being executed (treated as code)

## **C-Mode Processing**

- When a cell is in C-mode, it is processing information as Code
- A C-mode cell's configuration memory is being read and written (treated as data)

## Duality #3

- D-Mode Cell
  - Treats Information as Data
  - TreatsConfigurationMemory as Code

- C-Mode Cell
  - Treats Information as Code
  - TreatsConfigurationMemory as Data

If any C input=1, the cell enters
 C-Mode



 All outputs are driven to 0 except on the Active Side



• D<sub>IN</sub> from the active side is serially shifted into the cell's truth table



• As the truth table is shifted, it is sent serially to  $D_{OUT}$  on the active side



## Allows a cell to read a neighbor's configuration

- Assert C output to a neighbor
- Read D output from that neighbor
- Supply new configuration via D output to that neighbor

### Cell Read



#### Cell Read



## Final configuration



# Configuration bits can be processed

Example: Cell Move

### Cell Move



## Source and Target are in C-Mode



## Final Configuration Source has been moved!



## Minor modification for nondestructive read

## Cell Copy



## Source and Target are in C-Mode



# Final Configuration Source has been copied



## Using Duality #2 at the Cell Level

 Cells which are configured can subsequently configure other cells

## First, configure target



# Next, target configures secondary target



Secondary Target

## Finally, target returns to C-mode



Secondary Target now executing

## This allows cells to configure nonneighboring cells

- Very important capability
- Configurations generally require close cooperation among groups of cells

## This allows cells to configure nonneighboring cells

- Very important capability
- Configurations generally requires close cooperation among groups of cells

But why bother?

## BENEFITS OF THE CELL MATRIX ARCHITECTURE

Speed

Parallelism

- Parallelism
  - Hardware is parallel



- Parallelism
  - Hardware is parallel
  - Fine-grained design

- Parallelism
  - Hardware is parallel
  - Fine-grained design
  - Can custom design parallel hardware to your particular problem

- Parallelism
- Size

- Parallelism
- Size
  - High cell count

- Parallelism
- Size
  - High cell count Lots of parallelism

- Parallelism
- Size
  - High cell count Lots of parallelism
  - Still need to be smart about how you use
     HW

- Parallelism
- Size
  - High cell count Lots of parallelism
  - Still need to be smart about how you use
     HW
  - What is a high cell count?

# of gates in large FPGA:

# of gates in large FPGA:1 Million

- # of gates in large FPGA:1 Million
- # of gates in large ASIC:

- # of gates in large FPGA:1 Million
- # of gates in large ASIC:50 Million

- # of gates in large FPGA:1 Million
- # of gates in large ASIC:50 Million
- # of gates in the world:

- # of gates in large FPGA:1 Million
- # of gates in large ASIC:50 Million
- # of gates in the world:< 10</li>
   (J. Tour, MNT 2000)

- # of gates in large FPGA:1 Million
- # of gates in large ASIC:50 Million
- # of gates in the world:< 10
- # of carbon atoms in a pencil:

- # of gates in large FPGA:1 Million
- # of gates in large ASIC:50 Million
- # of gates in the world:< 10</li>
- # of carbon atoms in a pencil:>10

23

# Avogadro Machine: On order of 10 switches

Order 10<sup>21</sup>Cells

- Parallelism
- Size
- Manufacturing Support

- Parallelism
- Size
- Manufacturing Support
  - Scalable

- Parallelism
- Size
- Manufacturing Support
  - Scalable
  - Fault Tolerant
    - Only local interactions
    - Very few shared signals
    - Homogeneous hardware

- Parallelism
- Size
- Manufacturing Support
  - Scalable
  - Fault Tolerant
  - Also can be 3-D architecture

- Parallelism
- Size
- Manufacturing Support
  - Scalable
  - Fault Tolerant
  - Also can be 3-D architecture
  - Good features for nanotechnology

- Parallelism
- Size
- Manufacturing Support

 Adds up to Massively Parallel Data Processing

### QUESTION

How do you control a billion trillion cells?

#### **IDEA**

- Cell Matrix supports massively parallel data processing
- Code and Data are interchangeable
- Can we do massively parallel code processing?



# APPLICATIONS OF THE CELL MATRIX ARCHITECTURE

# Example: Massively Parallel Configuration A Space Filling Circuit

- Array of space filling blocks (multi-cell)
- Block will pass config info to bottom and right
- If adjacent to another block, pass as data
- Otherwise configure cells

Configure first space filling block



 Use that block to configure a second block



Use those blocks to configure two more



Use those blocks to configure two more



Use those blocks to configure three more



# Example: Massively Parallel Configuration

Space Filling Circuit

### **Growth Rate**

- Step 0: 1 block
- Step 1: 2 blocks
- Step 2: 4 blocks
- Step 3: 6 blocks
- Step 4: 9 blocks
- Step (even) n:((n+2)/2)<sup>2</sup> blocks

# Even faster on 3-D system

• After n steps # blocks = $((n+3)/3)^3$ 

• Want to configure 10<sup>19</sup> blocks

- Want to configure 10<sup>19</sup> blocks
- Assume 1 µsec per block

- Want to configure 10<sup>19</sup> blocks
- Assume 1 µsec per block
- Sequential FPGA requires 10<sup>19</sup> μsec

- Want to configure 10<sup>19</sup> blocks
- Assume 1 µsec per block
- Sequential FPGA requires 10<sup>19</sup> μsec 316880 Years

- Want to configure 10<sup>19</sup> blocks
- Assume 1 μsec per block
- Sequential FPGA requires 10<sup>19</sup> μsec 316880 Years (and 10.57 months)

### How about on a 3-D cell matrix?

•  $((n+3)/3)^3 = 10^{19}$ 

### How about on a 3-D cell matrix?

- $((n+3)/3)^3 = 10^{19}$
- n=6.46 million steps

#### How about on a 3-D cell matrix?

- $((n+3)/3)^3 = 10^{19}$
- n=6.46 million steps
- each step takes 1 μsec

#### How about on a 3-D cell matrix?

- $((n+3)/3)^3 = 10^{19}$
- n=6.46 million steps
- each step takes 1 μsec
- Total time=6.46 sec!

## This is the difference between an FPGA and a cell matrix

<u>FPGA</u> 316880 years Cell Matrix 6.46 seconds

# Space filling circuit is just one example

- Other parallel configuration techniques are possible
- Differentiation also possible

#### What else can we build?

- Space filling circuit is just a carrier
- Piggyback useful circuits on top of it

## Example: Large Search Problem

- Want to find value which solves some problem
- Build a basic search block which tests one particular value
- Tile these in an array, each block testing its own value
- Propagate inputs and answer back through array

# Basic Search Block Solve f(x)=0



## Array of blocks



### Would like uniform block



#### Now all blocks are identical



## Array of identical blocks



# Allows entire search space to be tested in a single iteration

- Well suited to problems that have:
  - Extremely large search space
  - Chaotic behavior
  - Complex algorithm

## Good example

56-bit DES Cracking

## **DES Cracking**

- Large search problem
- Build 2<sup>56</sup> search blocks
- Distribute known plain/cipher text
- Each block tests one unique key
- Correct key propagates out of array

## **DES Cracking**

- Can easily fit all 2<sup>56</sup> blocks on an Avogadro machine
- Can configure 2<sup>56</sup> blocks in a few seconds
- Can crack a text pair in submillisecond time
- Fast enough for 256K (DSL)



3 N - 0

# Can re-think old problems to find new algorithms

 Such as wide-bit arithmetic using wide-bit integer processors

## Combinatorial Multiplier



#### Other Useful Parallel Circuits

- Combinational nxn matrix multiplier
- APL machine
- Finite element method w/dynamic mesh refinement
- Simulation of physical processes

## Can Use Cell Matrix-Specific Features to Develop New, Unique Circuits

## **Expanding Counter**



## **Expanding Counter**



# This is an example of autonomous, self-modifying hardware!

## Leads to Interesting Possibilities

- Virtual Hardware
- Hardware Swapping
- Architecture Tuning



### **Evolvable Hardware**

## Ringed GA





Figure 6. Single Individual for the RGA. Configuration shown and comments below are for individual along top of a ring and not in a corner. Actual evolving circuit is stored in the Exploded Grid. Row, Col, Cmd, Input and Des Out come from global controller, and are sent in parallel to all individuals. Cmd selects the RGA operation, Row and Col access individual PIG cells within the individual's circuit, Input is a test input, Des Out is the desired output from the circuit. All other I/O lines connect to immediately neighboring individuals. I lines carry configuration information. Perf indicates an individual along the row has achieved a perfect score. Configuration of leftmost perfect individual appears along I Peerf.

#### **Fault Tolerance**

- Prior to building blocks of cells, can run self-tests on region inside matrix
- Then mark bad blocks and avoid internally
- Guardwalls
- Student working on utilizing bad block info

### STATUS AND FUTURE WORK

 Cell Matrix Corporation (www.cellmatrix.com)

- Cell Matrix Corporation
- Patents issued (US #5,886,537) and pending

- Cell Matrix Corporation
- Patents issued and pending
- Software toolset
  - Simulators
  - Layout Tools
  - Viewers
  - Debuggers
  - Compilers

- Cell Matrix Corporation
- Patents issued and pending
- Software toolset
- Spreading the word

- Cell Matrix Corporation
- Patents issued and pending
- Software toolset
- Spreading the word
- NASA SBIR

- Cell Matrix Corporation
- Patents issued and pending
- Software toolset
- Spreading the word
- NASA SBIR
- Collaborations

#### **Current Status**

- Cell Matrix Corporation
- Patents issued and pending
- Software toolset
- Spreading the word
- NASA SBIR
- Collaborations
- Students

# Future Work More Circuits

- Different parallel configuration schemes
- Numerous specific problems
- Libraries

# Future Work Application to Neural Networks

- Possible collaboration with Hugo de Garis (Starlab)
- Populations of neural networks
- Autonomous evolution

# Future Work Software Development

- 3-D Toolset
- Networked simulator
- Better debugging
- Better viewing
- Higher-level tools

# Future Work Evolvable Hardware

- Preliminary D-mode work done
- Evolving C-mode circuits?
- Hard to control, but can be done

# Future Work Hardware Fabrication

- FPGA implementation-up to 10,000 cells
- ASIC-Up to one million?
  - Need a foundry
- 3-D Manufacturing: ???

# Future Work Nanotechnology

- Unlimited potential
- Initial contacts
- Want to focus nanotechnology research

# Future Work Looking for students

- Lots of good research work
- Hardware (implementation, fault tolerance)
- Software (tool development)
- Cell Matrix (circuit development, methodologies, etc.)

# Acknowledgements

- Lisa Durbeck-Cell Matrix Expert, CEO
- Murali D. Raju-Co-inventor
- Lawrence B. Henry III-Co-inventor
- Hugo de Garis
- Jim Pearn (www.artificialbrains.com)
- Dimitri Yatsenko

#### For More Information

- Cell Matrix Corporation (www.cellmatrix.com)
- nmacias@cellmatrix.com
- US (877) 473-0882
   (801) 414-8900 (Nick)
- Come talk to me!

### Space Vs. Time

Fixed hardware uses
 O(1) space and
 O(n²) time



### Space Vs. Time

 Reconfigurable HW used to build n-bit accumulators and logic uses O(n) space and O(n) time



# Space Vs. Time

 Reconfigurable HW used to build an nxn multiplication array uses O(n²) space and O(1) time

