Mechanisms of Parallel Computing Organization for NeuroCluster
Babenko L.K., Chefranov A.G., Fedorov P.A., Korobko A.Yu., Makarevich
O.B.
347928, Russian Federation, Taganrog, Taganrog State University
of Radio Engineering
chefranov@mopevm.tsure.ru,
korobko@mailru.com, mak@tsure.ru
Abstract.Neurocluster based on NM6403 neuroprocessors architecture,
system software and programming technology are discussed. Special attention
was paid to operating system structure, data and control flow between subsystems,
internal data structures, system topology, programming language and general
parallel programming ideas.
Introduction
Neurocluster based on NM6403 is a part of a whole in general heterogeneous
network cluster. System is developed for SPMD (Single Program Multiple Data)
-tasks, but it is possible to execute MIMD (Multiple Instruction Multiple Data)
- tasks. Such system has to support data distribution, branch synchronization
and fast communications between subtasks [1]. This research is supported by
Russian Foundation for Basic Research (project ¹ 00-07-90300).
NM6403 microprocessor was developed by RT Module [2] and current programming
technology is uncomfortable for parallel programming. The main reason of this
is using low-level language [3]. There is a high level programming language
C++, but it does not allow using vector instructions. Programmer has to compose
code for parallel processing, communication organization and other supporting
code.
Parallel computing systems need high-speed mechanisms for transferring data
and system messages between processes. Every processor has two such links with
transfer rate of 20Mb/sec. Processors are connected by ring topology using two
rings with opposite transferring directions. As an alternative topology we use
star topology with communication device in the center. Communication device
is based on TMS320C40 signal processors and has memory accessed from neuroprocessors
connected to the device. Second communication mechanism is a transferring through
PCI or Compact PCI buses, but it is not the fastest way because it depends on
number of processors.
There are similar systems [4]. Philips produces Lneuro chip, which consists
of 16 processor elements with 16 bit registers. Each processor element can work
as 16 1bit, 8 2bit, 4 4bit, 2 8bit or 1 16bit processor element.
Hitachi developed Wafer Scale Integration. Every wafer contains neural network
with 576 neurons. Neuron has 64 8bit weight.
NM6403 has weight matrix 32x64 bit. Weight bit length varies from 1 to 64.
But this processor is not only neurochip, but DSP (Digital Signal Processor)
too.
1 Operating System
Operating system has module architecture and can be logically separated into
- Global task manager (GTM), which executed on Intel [5] 80x386 compatible
microprocessor. There is only one copy of GTM in system. GTM is a central
control subsystem, which monitor and communicate with other system modules.
- Local task managers (LTM), which executed on each NM6403 microprocessor.
It monitors tasks executed on local processor element and communicates with
another LTMs and GTM.
- Remote operating system console. It serves as a communication and control
tool between user and GTM.
Global Task Manager
Loading of GTM is the initial procedure for starting of operating system.
GTM provides following services: system modules initialization, control and
communications with another modules, collision detection, modules unloading
and system shutdown.
System modules initialization includes: GTM loading and system tables creation
(resource table, processes table, messages queue, etc.), LTM loading and establishing
connection with console.
Messages can be of the following types: informational, control and packet
messages.
Local Task Manager
LTM provides services for user tasks (communicational, computational, control,
etc.). Another function is the communication with GTM and LTM running on connected
by links processors. LTM switches between running processes to provide multitasking
environment in real time mode.
Dynamic Resource Management
Programmer separates code into branches during writing parallel programs.
If program is SPMD task then total number of branches is defined. Programmer
doesn’t know how many processors and other resources will be available during
program execution. Operating system has to provide needed number of processors.
If system contains N processors, SPMD task needs M processors, N<M, and
branches don’t produce communications to each other, operating system executes
N branches and places (M-N) branches to execution queue. But if branches intensively
interact then branches should be executed in parallel. In this case some branches
will wait needed resources (e.g. synchronization or communication) and will
be “slept” by operating system. “Slept” processes are placed to execution queue.
When process is started it placed in execution queue and its state is starting.
If system has unused processor and in execution queue there is only one waiting
process then it will be loaded to the free processor (GTM sends branch code
and supported data to the LTM). In the case when execution queue contains more
waiting processes than number of free processors, execution queue is analyzed.
Every process has its priority value (user can set these values to processes).
These values are used to choose the next process from queue for execution (see
Table 1 for example).
Table 1.
Example of Execution Queue
|
Priority of waiting process
|
4
|
1
|
2
|
1
|
3
|
3
|
4
|
1
|
|
Execution sequence
|
2
|
8
|
5
|
7
|
4
|
3
|
1
|
6
|
|
Queue growing
|
ß
|
Processes are taken from the execution queue when computing resources are
released or some process is slept.
2 Programming Technology
Programming Language
Programmer can divide code on branches, which will be executed on separated processors. Some branches can be executed in parallel. Others should wait for some event. It depends on logic of task. Following construction shows how branch can be declared.
branch proc_type branch_name ( parameter_list) { body },
branch is a keyword, which allow to generate special code for load, unload
and communication. Proc_type sets type of microprocessor, which allow to execute
branch; it can be nm for NM6403. Branch_Name is a unique identifier of branch.Parameter_list
is a list of parameters, which passed to branch; body – is the set of operators
and commands.
When OS loads branch, process is created, which consists of set of segments: code segment, data segments, stack segment etc. Parameters passed to branch and results of calculation are pushed to stack. When process is terminated, result is returned to main program.
Also parameters and results can be passed with help of MPI [6] (Message Passing Interface) or MPI-RT [7] (Message Passing Interface – Real Time).
Branch can be loaded by branchstart routine. Declaration of this function is following:
handle branchstart ( branch, num_param, param_list);
branch – identifier of branch , num_param – number of parameters, param_list – list of parameters. Function branchstart returns the process descriptor (ID).
Branchwait routine is used for barrier synchronization. Parameter of this function is a list of process descriptors terminated by 0 (null).
Function branchkill terminates execution of process. Process descriptor is
passed as a parameter.
For running SPMD-task, which consists of num_branches branches with the same code and different data, next function is used:
handle SPMD_start (num_branches, branch_name, num_param, list_param [dist]);
branch_name – branch identifier, num_param – number of parameters, list_param – list of parameters, dist – kind of data distribution. Function returns descriptor of SPMD-task.
For barrier synchronization with termination of SPMD-task following routine is used
SPMD_wait (handle ID);
ID – descriptor of SPMD-task.
Types of Data Distribution
Branches of SPMD-task share data. That is why we need to set method of data
distribution between branches. There are following types of data distribution:
- Cyclic type distributes data between branches one by one (see table 2).
- Block type distribute data between branches by dividing original data amount
into equal blocks. Number of blocks is the same as number of branches.
- Block-cyclic type is similar to cyclic distribution but data are distributed
not by one but by block.
Table 2.
Example of cyclic data distribution between 6 branches
|
Branch 1
|
Branch 2
|
Branch 3
|
Branch 4
|
Branch 5
|
Branch 6
|
|
1
|
2
|
3
|
4
|
5
|
6
|
|
7
|
8
|
9
|
10
|
11
|
12
|
|
…
|
…
|
…
|
…
|
…
|
…
|
Commutation
During the first stage of OS staring detecting of configuration is performed.
OS loader sends test message to first processor, which was found. When processor
receive message, it “remembers” sender and sends test message with help of own
links to another processors. This procedure repeats while there is one (or more)
processor, which has not received the message. Then message is returned back
and every device adds own identifier and number of link to the end of message.
Received sequence is analyzed to build topology of system.
In that case when there is no special commutation device, communication is
performed by using host-computer. It means that message is sent to host-computer
and then host-computer sends message to target processor. This method of communication
is slow and can be used when communication is performed rarely.
If there is special commutation device in the system, then all messages are
sent to that device and later to target processor or another commutation device.
When program is compiled compiler checks data requests, which are placed in
RAM of another processor. In that case compiler adds code for accessing to that
data. Host-computer (or commutation device if it present) has table of data
location, which are used during search of data.
Conclusion
Main operating system modules were coded and tested. These modules are Global
Task Manager, Local Task Manager and system console. Technology of programming
is ready.
Next part of work will include development of communication libraries, system
drivers, C and Fortran [8] compilers, system tools and such applications as
image processing, image recognition and maps processing (e.g. fingerprint recognition).
References
- Korobko A.Yu., Chefranov A.G.. Operation System for Computing System Based on NM6403 Neuroprocessors. In proceedings of conference “New Information technologies”, 2000, Taganrog, p. 130-132 (in Russian).
- RC “Module” web-site, NM6403 Documentation, www.jmichaelrei.com.
- Neuroprocessor NM6403. Base Software. Assembler Language Description. UFKV.30002-01 35 01, RT Module, 1997, 83 p.
- Korneev V.V. Parallel Computing Systems. Moscow, Knowledge, 1999, 320 p. (in Russian).
- Intel Corporation web-site, www.intel.com.
- “MPI-2: Extensions to the Message-Passing Interface”, Message Passing Interface Forum, July 18, 1997.
- “Document for the Real-Time Message Passing Interface (MPI/RT-1.0)”, Real-Time Message Passing Interface (MPI/RT) Forum, March 6, 2000.
- H. Zima, P. Chapman, H. Moritsch, and P. Mehrotra. Dynamic Data Distributions in Vienna Fortran. In proceedings of Supercomputing `93, Portland, OR, November 1993.
|