^{1}

^{*}

^{2}

^{1}

^{1}

^{2}

Restricted Boltzmann Machines (RBMs) are an effective model for machine learning; however, they require a significant amount of processing time. In this study, we propose a highly parallel, highly flexible architecture that combines small and completely parallel RBMs. This proposal addresses problems associated with calculation speed and exponential increases in circuit scale. We show that this architecture can optionally respond to the trade-offs between these two problems. Furthermore, our FPGA implementation performs at a 134 times processing speed up factor with respect to a conventional CPU.

Restricted Boltzmann Machines (RBMs) are an important component of Deep Belief Networks (DBNs). Moreover, DBNs have achieved high-quality results in many pattern recognition applications [

However, these architectures do not fully exploit the high parallelism of the neural networks, because each layer is processed sequentially. Higher parallelism requires a substantial circuit scaling, resulting in a trade-off between calculation speed and circuit scale. Therefore, we focus on high parallelism in this study. Furthermore, to address this trade-off, a high scalability is required, which must be optimally designed to satisfy the requirements of the designers.

In this study, we have devised an architecture that divides RBMs into blocks, to prevent exponential circuit scale increases and sequential calculations with respect to the number of blocks. Each RBM block can perform completely parallel calculations. The trade-off according to the circuit scale of each block has been verified in the following.

Section 2 describes the RBM algorithm. Section 3 explains the architecture proposed in this study. Section 4 studies the necessary bit precision of our architecture. Section 5 shows the register-transfer level (RTL) simulation results and FPGA implementation results. Section 6 provides our conclusions.

RBMs are a stochastic neural network model consisting of two layers (a visible layer and a hidden layer). As shown in

Different pairwise potential functions [

where E is the energy function, w is the weight between the visible layer and hidden layer, b is the visible bias, and c is the hidden bias. Because the connection of each neuron is restricted to the neurons in the complementary layer, the probability of firing of each neuron is given by

The RBM processing flow consists of two repeating steps. First, input data is added to the visible layer, and the hidden layer is calculated using the visible layer’s values as input. At this point, all visible layer and hidden layer combinations are calculated. Second, the visible layer is calculated, using sampling results from the hidden layer as input. By repeating these steps, it is possible to approximate the update formula given by

where ŋ is learning rate and k is the sampling number. This update formula can be obtained by using contrastive divergence (CD) learning [

To construct the DBNs, a hidden layer that has completed learning is used as a visible layer for the next RBMs. Deep networks are constructed by stacking these RBMs. After using back-propagation to perform fine-tuning, the DBNs are completed [

In this Section, we describe the overall data flow and specifications of the proposed architecture.

An overall diagram of the architecture is presented in

Input data is assumed to consist of unsigned 8-bit fixed-point numbers representing continuous values from 0 through 1. The bit precision of other parameters is discussed in detail in Section 4. This proposed architecture implements the most basic binary model. We have adopted this model as a foundation, because this model, although it is more suitable for binary data, can be utilized with continuous data and other models after minor modifications. Using binary data as input, every multiplier can be replaced with an AND operation.

In this Section, we present an example that consists of four RBM blocks with four inputs and four outputs. A detail block-diagram using RBMs divided into 4 blocks is shown in

In Phase1, each RBM block multiplies all the inputs and connection weights in parallel, and calculates the respective outputs using an adder tree. This approach is possible because each RBM block is configured on a small scale. Each RBM block has its own local memory to save parameters, which are only used inside each block.

Each shift controller is responsible of shifting propagated inputs and outputs to calculate all combinations. This movement of data is different in the v → h operation and the h → v operation.

because the arrangement sequence of the connection weights from the local memory is not identical to the v → h case, the appropriate connection weights must be re-selected. To this aim, the Fredkin gate (_{ij} = w_{ji} in first column); hence, the Fredkin gate is placed at the location of every vertical connection line of the select weights array in ^{2}-n)/2. This approach enables calculating data propagation in both ways using one single circuit, and is possible because each RBM block is configured on a small scale.

In Phase 2, the sigmoid calculation and binomial distribution calculation are performed. After completion of the sum-product operation, a sigmoid function is applied to the sum value. This result is processed by the sampling operation which yield the output of neurons as a probability of neuronal firing as expressed in Equations (4-6). The sigmoid function is approximated as a piecewise linear function to facilitate the implementation on hardware. The piecewise linear function is given by

where x is the input of the sigmoid function. In order to realize this equation, an arithmetic shifter [^{−2}. The corresponding circuit schematic is shown in

In Phase 3, the parameter update calculation is performed. While repeating propagation between the visible layer and hidden layer during the learning process, the update unit calculates the update formula in Equations (7)-(9).

The necessary bit precision of parameters which are weights, visible biases and hidden biases has been evaluated using a custom emulator written in a C code which can accurately simulate the proposed architecture in terms of data process and bit precision. The MNIST dataset [

We simulated the RTL model (coded in Verilog HDL) of the proposed architecture using ModelSim. The simulation conditions were set as follows. The size ratio of the visible and hidden layers of the RBMs is equal to 1:1. We confirmed the model’s speed and scale characteristics by changing the data size (=the number of neurons in each layer). Four types of models were considered, in which the number of inputs and outputs for each RBM block were 4, 8, 16, and finally in which the size of the data that was not divided into blocks.

The proposed architecture was implemented on a commercial FPGA board (DE3 with an Altera Stratix III FPGA). The performance of the proposed architecture is compared with other FPGA implementations (

From these results, we conclude that the proposed architecture could reduce circuit scale more effectively than simple parallelism. Furthermore, it could achieve both speed and circuit scaling in a linear manner with respect to the network size in terms of numbers of neurons. Consequently, designers may select the trade-off between speed and circuit scaling which is suitable to their requirements. In addition, the proposed architecture has been

This work (2016) | Kim et al. (2009) [ | Ly & Chow (2010) [ | Kim et al. (2014) [ | ||
---|---|---|---|---|---|

Platform | EP3SL340 | EP3SL340 | XC2VP70 | XC6VSX760 | |

# of chips | 1 | 1 | 1 | 4 | 1 |

Network | 256 × 256 | 256 × 256 | 128 × 128 | 256 × 256 | 256 × 256 |

Clock [MHz] | 50 | 200 | 100 | 80 | |

GCUPS^{a } | 16.6 (4 × 4)^{b } | N/A | 1.6 | 3.1 | 59.6 |

66.4 ( 16 × 16)^{b } |

a. CUPS = the number of weight/time for a learning step. b. Network size of a RBM block (16 × 16 is assumed for estimation).

Intel Xeon 3.3GHz 1 thread ?O3 | Our FPGA implementation (block IO is 4 × 4) | |
---|---|---|

Time [ms] | 627,500 | 4753 |

Speed up | 1× | 134× |

implemented on FPGA showing a 134× speed up factor with respect to a conventional CPU, and much faster operation compared to existing FPGA developments.

Kodai Ueyoshi,Takao Marukame,Tetsuya Asai,Masato Motomura,Alexandre Schmid, (2016) FPGA Implementation of a Scalable and Highly Parallel Architecture for Restricted Boltzmann Machines. Circuits and Systems,07,2132-2141. doi: 10.4236/cs.2016.79185