ÂÅÐÍÓÒÜÑß ÍÀÓ×ÍÀß ÐÀÁÎÒÀ      
ó÷åáà
íàóêà  
ñîòðóäíèêè  
íîâîñòè  


Mechanisms of Parallel Computing Organization for NeuroCluster

Babenko L.K., Chefranov A.G., Fedorov P.A., Korobko A.Yu., Makarevich O.B.

347928, Russian Federation, Taganrog, Taganrog State University of Radio Engineering

chefranov@mopevm.tsure.ru, korobko@mailru.com, mak@tsure.ru

Abstract.Neurocluster based on NM6403 neuroprocessors architecture, system software and programming technology are discussed. Special attention was paid to operating system structure, data and control flow between subsystems, internal data structures, system topology, programming language and general parallel programming ideas.

Introduction

Neurocluster based on NM6403 is a part of a whole in general heterogeneous network cluster. System is developed for SPMD (Single Program Multiple Data) -tasks, but it is possible to execute MIMD (Multiple Instruction Multiple Data) - tasks. Such system has to support data distribution, branch synchronization and fast communications between subtasks [1]. This research is supported by Russian Foundation for Basic Research (project ¹ 00-07-90300).

NM6403 microprocessor was developed by RT Module [2] and current programming technology is uncomfortable for parallel programming. The main reason of this is using low-level language [3]. There is a high level programming language C++, but it does not allow using vector instructions. Programmer has to compose code for parallel processing, communication organization and other supporting code.

Parallel computing systems need high-speed mechanisms for transferring data and system messages between processes. Every processor has two such links with transfer rate of 20Mb/sec. Processors are connected by ring topology using two rings with opposite transferring directions. As an alternative topology we use star topology with communication device in the center. Communication device is based on TMS320C40 signal processors and has memory accessed from neuroprocessors connected to the device. Second communication mechanism is a transferring through PCI or Compact PCI buses, but it is not the fastest way because it depends on number of processors.

There are similar systems [4]. Philips produces Lneuro chip, which consists of 16 processor elements with 16 bit registers. Each processor element can work as 16 1bit, 8 2bit, 4 4bit, 2 8bit or 1 16bit processor element.

Hitachi developed Wafer Scale Integration. Every wafer contains neural network with 576 neurons. Neuron has 64 8bit weight.

NM6403 has weight matrix 32x64 bit. Weight bit length varies from 1 to 64. But this processor is not only neurochip, but DSP (Digital Signal Processor) too.

1 Operating System

Operating system has module architecture and can be logically separated into

  • Global task manager (GTM), which executed on Intel [5] 80x386 compatible microprocessor. There is only one copy of GTM in system. GTM is a central control subsystem, which monitor and communicate with other system modules.
  • Local task managers (LTM), which executed on each NM6403 microprocessor. It monitors tasks executed on local processor element and communicates with another LTMs and GTM.
  • Remote operating system console. It serves as a communication and control tool between user and GTM.
Global Task Manager

Loading of GTM is the initial procedure for starting of operating system. GTM provides following services: system modules initialization, control and communications with another modules, collision detection, modules unloading and system shutdown.

System modules initialization includes: GTM loading and system tables creation (resource table, processes table, messages queue, etc.), LTM loading and establishing connection with console.

Messages can be of the following types: informational, control and packet messages.

Local Task Manager

LTM provides services for user tasks (communicational, computational, control, etc.). Another function is the communication with GTM and LTM running on connected by links processors. LTM switches between running processes to provide multitasking environment in real time mode.

Dynamic Resource Management

Programmer separates code into branches during writing parallel programs. If program is SPMD task then total number of branches is defined. Programmer doesn’t know how many processors and other resources will be available during program execution. Operating system has to provide needed number of processors.

If system contains N processors, SPMD task needs M processors, N<M, and branches don’t produce communications to each other, operating system executes N branches and places (M-N) branches to execution queue. But if branches intensively interact then branches should be executed in parallel. In this case some branches will wait needed resources (e.g. synchronization or communication) and will be “slept” by operating system. “Slept” processes are placed to execution queue.

When process is started it placed in execution queue and its state is starting. If system has unused processor and in execution queue there is only one waiting process then it will be loaded to the free processor (GTM sends branch code and supported data to the LTM). In the case when execution queue contains more waiting processes than number of free processors, execution queue is analyzed. Every process has its priority value (user can set these values to processes). These values are used to choose the next process from queue for execution (see Table 1 for example).

Table 1.

Example of Execution Queue

Priority of waiting process

4

1

2

1

3

3

4

1

Execution sequence

2

8

5

7

4

3

1

6

Queue growing

ß

Processes are taken from the execution queue when computing resources are released or some process is slept.

2 Programming Technology

Programming Language

Programmer can divide code on branches, which will be executed on separated processors. Some branches can be executed in parallel. Others should wait for some event. It depends on logic of task. Following construction shows how branch can be declared.

branch proc_type branch_name ( parameter_list) { body },

branch is a keyword, which allow to generate special code for load, unload and communication. Proc_type sets type of microprocessor, which allow to execute branch; it can be nm for NM6403. Branch_Name is a unique identifier of branch.Parameter_list is a list of parameters, which passed to branch; body – is the set of operators and commands.

When OS loads branch, process is created, which consists of set of segments: code segment, data segments, stack segment etc. Parameters passed to branch and results of calculation are pushed to stack. When process is terminated, result is returned to main program.

Also parameters and results can be passed with help of MPI [6] (Message Passing Interface) or MPI-RT [7] (Message Passing Interface – Real Time).

Branch can be loaded by branchstart routine. Declaration of this function is following:

handle branchstart ( branch, num_param, param_list);

branch – identifier of branch , num_param – number of parameters, param_list – list of parameters. Function branchstart returns the process descriptor (ID).

Branchwait routine is used for barrier synchronization. Parameter of this function is a list of process descriptors terminated by 0 (null).

Function branchkill terminates execution of process. Process descriptor is passed as a parameter.

For running SPMD-task, which consists of num_branches branches with the same code and different data, next function is used:

handle SPMD_start (num_branches, branch_name, num_param, list_param [dist]);

branch_name – branch identifier, num_param – number of parameters, list_param – list of parameters, dist – kind of data distribution. Function returns descriptor of SPMD-task.

For barrier synchronization with termination of SPMD-task following routine is used

SPMD_wait (handle ID);

ID – descriptor of SPMD-task.

Types of Data Distribution

Branches of SPMD-task share data. That is why we need to set method of data distribution between branches. There are following types of data distribution:

  1. Cyclic type distributes data between branches one by one (see table 2).
  2. Block type distribute data between branches by dividing original data amount into equal blocks. Number of blocks is the same as number of branches.
  3. Block-cyclic type is similar to cyclic distribution but data are distributed not by one but by block.
Table 2.

Example of cyclic data distribution between 6 branches

Branch 1

Branch 2

Branch 3

Branch 4

Branch 5

Branch 6

1

2

3

4

5

6

7

8

9

10

11

12


Commutation

During the first stage of OS staring detecting of configuration is performed. OS loader sends test message to first processor, which was found. When processor receive message, it “remembers” sender and sends test message with help of own links to another processors. This procedure repeats while there is one (or more) processor, which has not received the message. Then message is returned back and every device adds own identifier and number of link to the end of message. Received sequence is analyzed to build topology of system.

In that case when there is no special commutation device, communication is performed by using host-computer. It means that message is sent to host-computer and then host-computer sends message to target processor. This method of communication is slow and can be used when communication is performed rarely.

If there is special commutation device in the system, then all messages are sent to that device and later to target processor or another commutation device.

When program is compiled compiler checks data requests, which are placed in RAM of another processor. In that case compiler adds code for accessing to that data. Host-computer (or commutation device if it present) has table of data location, which are used during search of data.

Conclusion

Main operating system modules were coded and tested. These modules are Global Task Manager, Local Task Manager and system console. Technology of programming is ready.

Next part of work will include development of communication libraries, system drivers, C and Fortran [8] compilers, system tools and such applications as image processing, image recognition and maps processing (e.g. fingerprint recognition).

References

  1. Korobko A.Yu., Chefranov A.G.. Operation System for Computing System Based on NM6403 Neuroprocessors. In proceedings of conference “New Information technologies”, 2000, Taganrog, p. 130-132 (in Russian).
  2. RC “Module” web-site, NM6403 Documentation, www.jmichaelrei.com.
  3. Neuroprocessor NM6403. Base Software. Assembler Language Description. UFKV.30002-01 35 01, RT Module, 1997, 83 p.
  4. Korneev V.V. Parallel Computing Systems. Moscow, Knowledge, 1999, 320 p. (in Russian).
  5. Intel Corporation web-site, www.intel.com.
  6. “MPI-2: Extensions to the Message-Passing Interface”, Message Passing Interface Forum, July 18, 1997.
  7. “Document for the Real-Time Message Passing Interface (MPI/RT-1.0)”, Real-Time Message Passing Interface (MPI/RT) Forum, March 6, 2000.
  8. H. Zima, P. Chapman, H. Moritsch, and P. Mehrotra. Dynamic Data Distributions in Vienna Fortran. In proceedings of Supercomputing `93, Portland, OR, November 1993.

© N. Semenikhina, ÒÐÒÓ, Êàôåäðà ÁÈÒ, 2000

Ïîñëåäíåå îáíîâëåíèå :