PDF文库 - 千万精品文档,你想要的都能搜到,下载即用。

首届全国大学生RDMA竞赛通知.pdf

summer gone8 页 185.243 KB下载文档
首届全国大学生RDMA竞赛通知.pdf首届全国大学生RDMA竞赛通知.pdf首届全国大学生RDMA竞赛通知.pdf首届全国大学生RDMA竞赛通知.pdf首届全国大学生RDMA竞赛通知.pdf首届全国大学生RDMA竞赛通知.pdf
当前文档共8页 2.88
下载后继续阅读

首届全国大学生RDMA竞赛通知.pdf

RDMA Read and Write with IB Verbs∗ Tarick Bedeir Schlumberger tbedeir@slb.com October 22, 2010 Abstract This paper explains the operations required to use the remote direct memory access read and write features exposed by the InfiniBand verbs library. Sample code illustrates connection setup and data transfer using read and write (or get/put) semantics. 1 Introduction In my last paper [1], I described basic verbs applications that exchange data by posting sends and receives. In this paper I’ll describe the construction of applications that use remote direct memory access, or RDMA [5]. Why would we want to use RDMA? Because it can provide lower latency and allow for zero-copy transfers (i.e., place data at the desired target location without buffering). Consider the iSCSI Extensions for RDMA, iSER [4]. The initiator, or client, issues a read request that includes a destination memory address in its local memory. The target, or server, responds by writing the desired data directly into the initiator’s memory at the requested location. No buffering, minimal operating system involvement (since data is copied by the network adapters), and low latency – generally a winning formula. Using RDMA with verbs is fairly straightforward: first register blocks of memory, then exchange memory descriptors, then post read/write operations. Registration is accomplished with a call to ibv_reg_mr(), which pins the block of memory in place (thus preventing it from being swapped out) and returns a struct ibv_mr * containing a uint32_t key allowing remote access to the registered memory. This key, along with the block’s address, must then be exchanged with peers through some out-of-band mechanism. Peers can then use the key and address in calls to ibv_post_send() to post RDMA read and write requests. Some code might be instructive: ∗ Adapted from a blog post at http://thegeekinthecorner.wordpress.com/ 1 /∗ PEER 1 ∗/ /∗ PEER 2 ∗/ const s i z e t SIZE = 1 0 2 4 ; const s i z e t SIZE = 1 0 2 4 ; char ∗ b u f f e r = m a l l o c ( SIZE ) ; s t r u c t i b v m r ∗mr ; u i n t 3 2 t my key ; u i n t 6 4 t my addr ; char ∗ b u f f e r = m a l l o c ( SIZE ) ; s t r u c t i b v m r ∗mr ; struct i b v s g e sge ; s t r u c t i b v s e n d w r wr , ∗ bad wr ; uint32 t peer key ; uint64 t peer addr ; mr = i b v r e g m r ( pd , buffer , SIZE , IBV ACCESS REMOTE WRITE ) ; mr = i b v r e g m r ( pd , buffer , SIZE , IBV ACCESS LOCAL WRITE ) ; my key = mr−>r k e y ; my addr = ( u i n t 6 4 t ) mr−>addr ; /∗ g e t p e e r k e y and p e e r a d d r from p e e r 1 ∗/ /∗ exchange my key and my addr w i t h p e e r 2 ∗/ strcpy ( buffer , " Hello !" ) ; memset(&wr , 0 , s i z e o f ( wr ) ) ; s g e . addr = ( u i n t 6 4 t ) b u f f e r ; s g e . l e n g t h = SIZE ; s g e . l k e y = mr−>l k e y ; wr . s g l i s t = &s g e ; wr . num sge = 1 ; wr . opcode = IBV WR RDMA WRITE ; wr . wr . rdma . r e m o t e a d d r = p e e r a d d r ; wr . wr . rdma . r k e y = p e e r k e y ; i b v p o s t s e n d ( qp , &wr , &bad wr ) ; The last parameter to ibv_reg_mr() for peer 1, IBV_ACCESS_REMOTE_WRITE, specifies that we want peer 2 to have write access to the block of memory located at buffer. Using this in practice is more complicated. The sample code that accompanies this post connects two hosts, exchanges memory region keys, reads from or writes to remote memory, then disconnects. The sequence is as follows: 1. Initialize context and register memory regions. 2. Establish connection. 3. Use send/receive model described in previous posts to exchange memory region keys between peers. 4. Post read/write operations. 5. Disconnect. Each side of the connection will have two threads: the main thread, which processes connection events, and the thread polling the completion queue. In order to avoid deadlocks and race conditions, we arrange our operations so that only one thread at a time is posting work requests. To elaborate on the sequence above, after establishing the connection the client will: 1. Send its RDMA memory region key in a MSG_MR message. 2. Wait for the server’s MSG_MR message containing its RDMA key. 3. Post an RDMA operation. 4. Signal to the server that it is ready to disconnect by sending a MSG_DONE message. 2 5. Wait for a MSG_DONE message from the server. 6. Disconnect. Step one happens in the context of the RDMA connection event handler thread, but steps two through six are in the context of the verbs CQ polling thread. The sequence of operations for the server is similar: 1. Wait for the client’s MSG_MR message with its RDMA key. 2. Send its RDMA key in a MSG_MR message. 3. Post an RDMA operation. 4. Signal to the client that it is ready to disconnect by sending a MSG_DONE message. 5. Wait for a MSG_DONE message from the client. 6. Disconnect. Here all six steps happen in the context of the verbs CQ polling thread. Waiting for MSG_DONE is necessary otherwise we might close the connection before the peer’s RDMA operation has completed. Note that we don’t have to wait for the RDMA operation to complete before sending MSG_DONE – the InfiniBand specification requires that requests will be initiated in the order in which they’re posted. This means that the peer won’t receive MSG_DONE until the RDMA operation has completed. 2 Read/Write Demonstrations The code for this sample [2] merges a lot of the client and server code from the previous set of posts for the sake of brevity (and to illustrate that they’re nearly identical). Both the client (rdma-client) and the server (rdma-server) continue to operate different RDMA connection manager event loops, but they now share common verbs code – polling the CQ, sending messages, posting RDMA operations, etc. We also use the same code for both RDMA read and write operations since they’re very similar. rdma-server and rdma-client take either ”read” or ”write” as their first command-line argument. Let’s start from the top of rdma-common.c, which contains verbs code common to both the client and the server. We first define our message structure. We’ll use this to pass RDMA memory region (MR) keys between nodes and to signal that we’re done. s t r u c t message { enum { MSG MR, MSG DONE } type ; union { s t r u c t i b v m r mr ; } data ; }; Our connection structure has been expanded to include memory regions for RDMA operations as well as the peer’s MR structure and two state variables: struct connection { s t r u c t rdma cm id ∗ i d ; s t r u c t i b v q p ∗qp ; int connected ; struct struct struct struct ibv mr ibv mr ibv mr ibv mr ∗ recv mr ; ∗ send mr ; ∗ rdma local mr ; ∗ rdma remote mr ; struct ibv mr peer mr ; 3 s t r u c t message ∗ r e c v m s g ; s t r u c t message ∗ send msg ; char ∗ r d m a l o c a l r e g i o n ; char ∗ r d m a r e m o t e r e g i o n ; enum { SS INIT , SS MR SENT , SS RDMA SENT , SS DONE SENT } send state ; enum { RS INIT , RS MR RECV, RS DONE RECV } recv state ; }; send_state and recv_state are used by the completion handler to properly sequence messages and RDMA operations between peers. This structure is initialized by build_connection(): void b u i l d c o n n e c t i o n ( s t r u c t rdma cm id ∗ i d ) { s t r u c t c o n n e c t i o n ∗ conn ; struct i b v q p i n i t a t t r q p a t t r ; b u i l d c o n t e x t ( i d −>v e r b s ) ; b u i l d q p a t t r (& q p a t t r ) ; TEST NZ( r d m a c r e a t e q p ( i d , s c t x −>pd , &q p a t t r ) ) ; i d −>c o n t e x t = conn = ( s t r u c t c o n n e c t i o n ∗ ) m a l l o c ( s i z e o f ( s t r u c t c o n n e c t i o n ) ) ; conn−>i d = i d ; conn−>qp = i d −>qp ; conn−>s e n d s t a t e = SS INIT ; conn−>r e c v s t a t e = RS INIT ; conn−>c o n n e c t e d = 0 ; r e g i s t e r m e m o r y ( conn ) ; p o s t r e c e i v e s ( conn ) ; } Since we’re using RDMA read operations, we have to set initiator_depth and responder_resources in struct rdma_conn_param. These control [3] the number of simultaneous outstanding RDMA read requests: void b u i l d p a r a m s ( s t r u c t rdma conn param ∗ params ) { memset ( params , 0 , s i z e o f ( ∗ params ) ) ; params−>i n i t i a t o r d e p t h = params−>r e s p o n d e r r e s o u r c e s = 1 ; params−>r n r r e t r y c o u n t = 7 ; /∗ i n f i n i t e r e t r y ∗/ } Setting rnr_retry_count to 7 indicates that we want the adapter to resend indefinitely if the peer responds with a receiver-not-ready (RNR) error. RNRs happen when a send request is posted before a corresponding receive request is posted on the peer. Sends are posted with the send_message() function: void s e n d m e s s a g e ( s t r u c t c o n n e c t i o n ∗ conn ) { s t r u c t i b v s e n d w r wr , ∗ bad wr = NULL ; struct i b v s g e sge ; memset(&wr , 0 , s i z e o f ( wr ) ) ; wr . w r i d = ( u i n t p t r t ) conn ; wr . opcode = IBV WR SEND ; wr . s g l i s t = &s g e ; wr . num sge = 1 ; wr . s e n d f l a g s = IBV SEND SIGNALED ; s g e . addr = ( u i n t p t r t ) conn−>send msg ; 4 s g e . l e n g t h = s i z e o f ( s t r u c t message ) ; s g e . l k e y = conn−>send mr−>l k e y ; while ( ! conn−>c o n n e c t e d ) ; TEST NZ( i b v p o s t s e n d ( conn−>qp , &wr , &bad wr ) ) ; } send_mr() wraps this function and is used by rdma-client to send its MR to the server, which then prompts the server to send its MR in response, thereby kicking off the RDMA operations: void send mr ( void ∗ c o n t e x t ) { s t r u c t c o n n e c t i o n ∗ conn = ( s t r u c t c o n n e c t i o n ∗ ) c o n t e x t ; conn−>send msg−>t y p e = MSG MR; memcpy(&conn−>send msg−>d a t a . mr , conn−>rdma remote mr , s i z e o f ( struct ibv mr ) ) ; s e n d m e s s a g e ( conn ) ; } The completion handler does the bulk of the work. It maintains send_state and recv_state, replying to messages and posting RDMA operations as appropriate: void o n c o m p l e t i o n ( s t r u c t i b v w c ∗wc ) { s t r u c t c o n n e c t i o n ∗ conn = ( s t r u c t c o n n e c t i o n ∗ ) ( u i n t p t r t ) wc−>w r i d ; if ( wc−>s t a t u s != IBV WC SUCCESS) d i e ( " o n _ c o m p l e t i o n : status is not I B V _ W C _ S U C C E S S . " ) ; if ( wc−>opcode & IBV WC RECV) { conn−>r e c v s t a t e ++; if ( conn−>r e c v m s g −>t y p e == MSG MR) { memcpy(&conn−>pe er mr , &conn−>r e c v m s g −>d a t a . mr , s i z e o f ( conn−>p e e r m r ) ) ; p o s t r e c e i v e s ( conn ) ; /∗ o n l y rearm f o r MSG MR ∗/ if ( conn−>s e n d s t a t e == SS INIT ) /∗ r e c e i v e d p e e r ’ s MR b e f o r e s e n d i n g ours , so send o u r s back ∗/ send mr ( conn ) ; } } else { conn−>s e n d s t a t e ++; p r i n t f ( " send c o m p l e t e d s u c c e s s f u l l y .\ n " ) ; } if ( conn−>s e n d s t a t e == SS MR SENT && conn−>r e c v s t a t e == RS MR RECV) { s t r u c t i b v s e n d w r wr , ∗ bad wr = NULL ; struct i b v s g e sge ; if ( s mode == M WRITE) p r i n t f ( " r e c e i v e d MSG_MR . w r i t i n g m e s s a g e to remote memory ...\ n " ) ; else p r i n t f ( " received MSG_MR . reading message from remote memory ...\ n" ) ; memset(&wr , 0 , s i z e o f ( wr ) ) ; wr . w r i d = ( u i n t p t r t ) conn ; wr . opcode = ( s mode == M WRITE) ? IBV WR RDMA WRITE : IBV WR RDMA READ ; wr . s g l i s t = &s g e ; wr . num sge = 1 ; wr . s e n d f l a g s = IBV SEND SIGNALED ; wr . wr . rdma . r e m o t e a d d r = ( u i n t p t r t ) conn−>p e e r m r . addr ; wr . wr . rdma . r k e y = conn−>p e e r m r . r k e y ; s g e . addr = ( u i n t p t r t ) conn−>r d m a l o c a l r e g i o n ; s g e . l e n g t h = RDMA BUFFER SIZE ; s g e . l k e y = conn−>r d m a l o c a l m r −>l k e y ; TEST NZ( i b v p o s t s e n d ( conn−>qp , &wr , &bad wr ) ) ; conn−>send msg−>t y p e = MSG DONE; s e n d m e s s a g e ( conn ) ; } e l s e i f ( conn−>s e n d s t a t e == SS DONE SENT && conn−>r e c v s t a t e == RS DONE RECV) { p r i n t f ( " r e m o t e b u f f e r : % s \ n " , g e t p e e r m e s s a g e r e g i o n ( conn ) ) ; r d m a d i s c o n n e c t ( conn−>i d ) ; } 5 } Let’s examine on_completion() in parts. First, the state update: if ( wc−>opcode & IBV WC RECV) { conn−>r e c v s t a t e ++; if ( conn−>r e c v m s g −>t y p e == MSG MR) { memcpy(&conn−>pe er mr , &conn−>r e c v m s g −>d a t a . mr , s i z e o f ( conn−>p e e r m r ) ) ; p o s t r e c e i v e s ( conn ) ; /∗ o n l y rearm f o r MSG MR ∗/ if ( conn−>s e n d s t a t e == SS INIT ) /∗ r e c e i v e d p e e r ’ s MR b e f o r e s e n d i n g ours , so send o u r s back ∗/ send mr ( conn ) ; } } else { conn−>s e n d s t a t e ++; p r i n t f ( " send c o m p l e t e d s u c c e s s f u l l y .\ n " ) ; } If the completed operation is a receive operation (i.e., if wc->opcode has IBV_WC_RECV set), then recv_state is incremented. If the received message is MSG_MR, we copy the received MR into our connection structure’s peer_mr member, and rearm the receive slot. This is necessary to ensure that we receive the MSG_DONE message that follows the completion of the peer’s RDMA operation. If we’ve received the peer’s MR but haven’t sent ours (as is the case for the server), we send our MR back by calling send_mr(). Updating send_state is uncomplicated. Next we check for two particular combinations of send_state and recv_state: if ( conn−>s e n d s t a t e == SS MR SENT && conn−>r e c v s t a t e == RS MR RECV) { s t r u c t i b v s e n d w r wr , ∗ bad wr = NULL ; struct i b v s g e sge ; ( s mode == M WRITE) p r i n t f ( " r e c e i v e d MSG_MR . w r i t i n g m e s s a g e to remote memory ...\ n " ) ; else p r i n t f ( " received MSG_MR . reading message from remote memory ...\ n" ) ; if memset(&wr , 0 , s i z e o f ( wr ) ) ; wr . w r i d = ( u i n t p t r t ) conn ; wr . opcode = ( s mode == M WRITE) ? IBV WR RDMA WRITE : IBV WR RDMA READ ; wr . s g l i s t = &s g e ; wr . num sge = 1 ; wr . s e n d f l a g s = IBV SEND SIGNALED ; wr . wr . rdma . r e m o t e a d d r = ( u i n t p t r t ) conn−>p e e r m r . addr ; wr . wr . rdma . r k e y = conn−>p e e r m r . r k e y ; s g e . addr = ( u i n t p t r t ) conn−>r d m a l o c a l r e g i o n ; s g e . l e n g t h = RDMA BUFFER SIZE ; s g e . l k e y = conn−>r d m a l o c a l m r −>l k e y ; TEST NZ( i b v p o s t s e n d ( conn−>qp , &wr , &bad wr ) ) ; conn−>send msg−>t y p e = MSG DONE; s e n d m e s s a g e ( conn ) ; } e l s e i f ( conn−>s e n d s t a t e == SS DONE SENT && conn−>r e c v s t a t e == RS DONE RECV) { p r i n t f ( " r e m o t e b u f f e r : % s \ n " , g e t p e e r m e s s a g e r e g i o n ( conn ) ) ; r d m a d i s c o n n e c t ( conn−>i d ) ; } The first of these combinations is when we’ve both sent our MR and received the peer’s MR. This indicates that we’re ready to post an RDMA operation and post MSG_DONE. Posting an RDMA operation means building an RDMA work request. This is similar to a send work request, except that we specify an RDMA opcode and pass the peer’s RDMA address/key: wr . opcode = ( s mode == M WRITE) ? IBV WR RDMA WRITE : IBV WR RDMA READ ; wr . wr . rdma . r e m o t e a d d r = ( u i n t p t r t ) conn−>p e e r m r . addr ; wr . wr . rdma . r k e y = conn−>p e e r m r . r k e y ; Note that we’re not required to use conn->peer_mr.addr for remote_addr – we could, if we wanted to, use any address falling within the bounds of the memory region registered with ibv_reg_mr(). 6 The second combination of states is SS_DONE_SENT and RS_DONE_RECV, indicating that we’ve sent MSG_DONE and received MSG_DONE from the peer. This means it is safe to print the message buffer and disconnect: p r i n t f ( " r e m o t e b u f f e r : % s \ n " , g e t p e e r m e s s a g e r e g i o n ( conn ) ) ; r d m a d i s c o n n e c t ( conn−>i d ) ; 3 Conclusion If everything’s working properly, you should see the following when using RDMA writes: $ ./rdma-server write listening on port 47881. received connection request. send completed successfully. received MSG_MR. writing message to remote memory... send completed successfully. send completed successfully. remote buffer: message from active/client side with pid 20692 peer disconnected. $ ./rdma-client write 192.168.0.1 47881 address resolved. route resolved. send completed successfully. received MSG_MR. writing message to remote memory... send completed successfully. send completed successfully. remote buffer: message from passive/server side with pid 26515 disconnected. And when using RDMA reads: $ ./rdma-server read listening on port 47882. received connection request. send completed successfully. received MSG_MR. reading message from remote memory... send completed successfully. send completed successfully. remote buffer: message from active/client side with pid 20916 peer disconnected. $ ./rdma-client read 192.168.0.1 47882 address resolved. route resolved. send completed successfully. received MSG_MR. reading message from remote memory... send completed successfully. send completed successfully. remote buffer: message from passive/server side with pid 26725 disconnected. 7 References [1] T. Bedeir. Building an RDMA-Capable Application with IB Verbs. Technical report, HPC Advisory Council, 2010. Available from: http://www.hpcadvisorycouncil.com/pdf/ building-an-rdma-capable-application-with-ib-verbs.pdf. [2] T. Bedeir. RDMA Read/Write Sample Code [online]. 2010. Available from: https://sites.google. com/a/bedeir.com/home/rdma-read-write.tar.gz?attredirects=0&d=1. [3] rdma accept(3) – linux man page [online]. Available from: http://linux.die.net/man/3/rdma_accept. [4] Wikipedia. iSCSI Extensions for RDMA [online]. 2010. Available from: http://en.wikipedia.org/ wiki/ISCSI_Extensions_for_RDMA. [5] Wikipedia. Remote direct memory access [online]. 2010. Available from: http://en.wikipedia.org/ wiki/RDMA. 8

相关文章