徐跃东 - 师资队伍 - 复旦大学信息科学与工程学院.pdf

1 Blockchain Mining with Multiple Selfish Miners arXiv:2112.10454v3 [cs.CR] 1 Apr 2023 Qianlan Bai, Yuedong Xu, Nianyi Liu, Xin Wang Abstract—This paper studies a fundamental problem regarding the security of blockchain PoW consensus on how the existence of multiple misbehaving miners influences the profitability of selfish mining. Each selfish miner maintains a private chain and makes it public opportunistically for acquiring more rewards incommensurate to his Hash power. We first establish a general Markov chain model to characterize the state transition of public and private chains for Basic Selfish Mining (BSM), and derive the stationary profitable threshold of Hash power in closed form. It reduces from 25% for a single attacker to below 21.48% for two symmetric attackers theoretically, and further reduces to around 10% with eight symmetric attackers experimentally. We next explore the profitable threshold when one of the attackers performs strategic mining based on Partially Observable Markov Decision Process (POMDP) that only half of the attributes pertinent to a mining state are observable to him. An online algorithm is presented to compute the nearly optimal policy efficiently despite the large state space and high dimensional belief space. The profitable threshold is much lower for the strategic attacker. Last, we formulate a simple model of absolute mining revenue that yields an interesting observation: selfish mining is never profitable at the first difficulty adjustment period, but relies on the reimbursement of stationary selfish mining gains in future periods. The delay till being profitable of an attacker increases with the decrease of his Hash power, making blockchain miners more cautious about performing selfish mining. Index Terms—Proof-of-Work, Selfish Mining, Profitability, Markov Chain, Partially Observable MDP. I. I NTRODUCTION Bitcoin has gained tremendous concerns as the first fully decentralized cryptocurrency since its advent in 2008. All historical transactions are recorded in a global and public data structure known as blockchain. The security of blockchain relies on consensus algorithms to come to an agreement among a large body of pseudonymous participants called miners [1]. Solving a Hash puzzle is deemed as a way to generate Proof-of-Work (PoW) for reaching global consensus. Each miner competes in this “game”, and is rewarded with cryptocurrencies if he finds a valid block first. The PoW consensus has been widely deployed in public blockchains, serving as the cornerstone of current major cryptocurrencies. The security of PoW is challenged by the trend of centralization of Hash power. Mining a Bitcoin block is random and it needs more than 10 years on average with a latest-generation ASIC chip [2]. Therefore, miners operate cooperatively to Q. Bai and X. Wang are with the School of Computer Science, Fudan University, Shanghai 200438, China. Y. Xu and N. Liu are with the School of Information Science and Technology, Fudan University, Shanghai 200438, China. This work was supported in part by the Natural Science Foundation of China under Grant Grant 62072117, in part by the Shanghai Natural Science Foundation under the Grant 22ZR1407000, in part by Shanghai-Hong Kong Collaborative Project under Grant 18510760900 and and in part by Zhuhai Research Institute of Fudan University. Yuedong Xu is the corresponding author. form pools to acquire a stable income rate. As a side effect, a small number of mining pools occupy a vast majority of global Hash power, placing blockchain systems at risk of being overthrown. The conventional wisdom believes that PoW is secure as long as no mining pool controls 51% of total Hash power. However, a miner can choose to mine selfishly instead of conforming to the standard Bitcoin protocol. Selfish mining refers to a class of block publishing policies in which a miner does not release his newly found block immediately, but forks a private chain of which others are unaware. At a future epoch, he will release his private blocks strategically to obsolete the current public chain for the purpose of obtaining a higher share of valid blocks in the new public chain than his ratio of Hash power. The minimum ratio of Hash power that brings extra rewards is conventionally called profitable threshold. Eyal and Sirer introduced the first selfish mining scheme (namely basic selfish mining, BSM) and pointed out that the profitable threshold of BSM is 25% of total Hash power [3]. Nayak et al. [4] proposed the stubborn mining that improves the revenue of the selfish miner by 13.94% compared to BSM. Sapirshtein et al. modeled the optimal selfish mining as a Markov Decision Process (MDP) that reduces the selfish miner’s profitable threshold to 23.21% [5]. Tao et al. [6] introduced the semi-selfish mining attack based on hidden MDP with the control of fork rate. Grunspan and Perez-Marco proved rigorously using martingale theory that selfish mining is an attack on the difficulty adjustment algorithm of blockchain consensus [7]. Recently, a lot of efforts have been devoted to the compound attacks of selfish mining with block withholding attacks [8] [9], bribery attacks [10], eclipse attacks and double spending attacks [11]. Until very recently, the competition of multiple noncolluding selfish miners came into view. Liu et al. presented the publish-n scheme for two selfish mining attackers and simulated their revenues as well as profitable thresholds [12]. Bai et al. modeled the mining race among two BSM miners and one honest miner as a Markov process [13]. Zhang et al. [14] simulated the profitable threshold in the presence of multiple selfish miners. Charlie et al. [15] proposed SquirRL, a framework for using deep reinforcement learning (DRL) to analyze selfish mining and block withholding attacks in blockchain systems with two attackers. Previous experimental studies, though insightful, lack theoretical understandings on the principle of competitive selfish mining. When compared to the scenario of a solitary attacker, the presence of multiple selfish mining attackers results in two noteworthy effects. First, the fierce competition among attackers will make the system state transition more complex. Second, non-colluding attackers conduct strategic mining independently, making all miners incapable of acquiring comprehensive information about the network. Consequently, it is usually believed that modeling 2 the profitable threshold with multiple selfish miners is hard, and the optimal mining may be intractable due to the abovementioned challenges. In this paper, we theoretically investigate the profitability of selfish mining with multiple attackers by asking a sequence of key progressive questions: 1) Will selfish mining become more easily profitable with multiple attackers than with a solo attacker? 2) How can we design a nearly optimal mining policy for an attacker despite the complex interactions among miners and the incomplete observations of the system state? 3) How long should a BSM attacker wait from the beginning of selfish mining until being profitable eventually? Figuring out these questions will provide essential understandings of the security of blockchain PoW consensus. We formulate a Markov chain model to compute the relative revenue of BSM attackers for the first question, which is the percentage of his valid blocks in the public chain. Our model is very general in the sense that it can capture cases with more than two attackers or allow an attacker to hide multiple blocks privately. In particular, the latter case may cause the complicated chain-reaction release in which the publishing action of one attacker triggers that of the other attacker. We demonstrate how the existence of multiple attackers affects the system security from a theoretic perspective. Answering the second question is very challenging if a strategic attacker with a nearly optimal mining policy (Alice), a BSM attacker (Bob) and an honest miner (Henry) coexist in the system. Firstly, the interactions among three chains are more complicated. The state of the mining race is captured by a 10-tuple pertinent to the composition of all chains and the forking status, as opposed to a 3-tuple in MDP-based optimal policy for a single attacker. The number of actions is larger and the state-action pairs can be 102 to 103 times larger. Second, Alice as the strategic miner cannot observe the complicated state information. In fact, she is merely aware of 5 attributes at each state related to her private chain and the public chain. We formulate a family of parameterized partially observable Markov decision process (POMDP) models to characterize Alice’s strategic mining policy with a continuous belief of the current state. To tackle the large state space and the highdimensional belief space, we adopt AEMS2 to compute the nearly optimal mining policy. A binary search method similar to [5] is used to find the maximum revenue among the family of POMDP models. As for the third question, we build a simple model to compute attackers’ absolute revenue, which is the average number of valid blocks received by each miner per unit of time. Since selfish mining is an attack on the difficulty adjustment algorithm (DAA), it is not profitable instantaneously even if the attacker’s Hash power is above the profitable threshold. This model enables us to compute the number of DAA periods that lead to profitable selfish mining eventually. Meanwhile, we prove that the absolute revenue and relative revenue are approximately equivalent on a sufficiently long time scale. Our major observations are summarized as below. • BSM. The Markov chain is established to model the revenues of each miner when there are multiple basic selfish mining attackers in the system. We prove the prof- itable threshold of Hash power is below 21.48% with two symmetric BSM attackers from theory and experiment, as opposed to 25% with a single BSM attacker and 23.21% with a single optimal attacker. More blocks allowed to hold privately or more attackers will reduce this threshold. • POMDP. We propose POMDP-based mining policy which brings more revenues to the strategic miner than BSM and honest mining. When the BSM attacker (Bob) has 34% Hash power, the other attacker’s (Alice’s) profitable threshold decreases from 29.44% to about 2% if she chooses POMDP-based policy rather than BSM. The designed online algorithm can rapidly and effectively compute the near optimal action under the current observable information. • Profitable Delay. We make a transient analysis of basic selfish mining and explore the benefit time. A BSM miner receives less absolute revenue than honest mining in the first difficulty adjustment period even if his Hash power is above the profitable threshold, and his gain is achieved in future periods. BSM is profitable after 51 rounds of difficulty adjustment (i.e. 714 days in Bitcoin) if the Hash power of two symmetric selfish miners is 22%. This delay decreases to 5 rounds (i.e. 70 days in Bitcoin) as their Hash power accrues to 33%, which is still quite long. The remainder of this paper is organized as follows. Section II describes the background of selfish mining. Section III models the relative revenue of basic selfish mining with different attackers. Section IV proposed the POMDP-based mining policy and designed the efficient algorithm. The profitable time of basic selfish mining is modeled in Section V. Section VI validates the revenue model of BSM and the performance of the POMDP-based policy. Section VII introduces the related work, and Section VIII concludes our work. II. S ELFISH -M INE S TRATEGY In this section, we present the block-release procedure of blockchain mining in the presence of two adversarial miners. We further introduce the new features on tie-breaking and chain-reaction release. A. System Description Consider a blockchain system with two misbehaving attackers 1 Alice and Bob, as well as an honest miner, Henry 2 . They compete to mine a valid block for the purpose of acquiring bitcoin-like tokens. The proof-of-work (PoW) consensus is adopted and the mining of blocks is stateless: the probability of discovering a block by a miner is proportional to his current Hash power, but inversely proportional to the current aggregate Hash power of the entire blockchain network. The blockchain system dynamically adjusts the difficulty of cryptographic puzzles such that new blocks are generated at a fixed average 1 A malicious mining pool can be treated as a miner because its Hash power is the sum of the Hash power of its members, and the pool manager is responsible for broadcasting any blocks mined by its members to the network. 2 Multiple honest miners can be boiled down to a single miner for the sake of their linear additivity of Hash powers and the consistency of their mining strategies. 3 rate (e.g., one block per 10 minutes on average in Bitcoin). The miners maintain a globally-agreed ordered set of transactions via the adoption and the mining on the longest chain. The reward of each valid block is normalized as one cryptographic coin. For simplicity, we make the following assumptions consistent with the literature [3] [5]: • The total Hash power of the blockchain system is normalized as a unit. Then, the Hash power of a miner is represented as a fraction of the total; • The block discovery time is exponentially distributed. These assumptions are obtained based on the PoW mining mechanism [16] [17]. They allow us to succinctly express the probability of generating a block per time unit and the probability that each miner generates the next block. It is worth noting that the memoryless of the exponential distribution guarantees that the probability of generating a block in the current time unit is not affected by how much time has already passed. The honest miner Henry who finds a valid block will release it immediately. Alice (resp. Bob) may release her blocks strategically by forcing Henry into wasting his computation. When Alice and Bob are both selfish miners, the interaction between two private chains becomes more complicated because none of them knows the other’s state. In what follows, we capture all the different states that each miner may encounter. Denote by α1 , α2 and αh the Hash powers of Alice, Bob and Henry respectively, i.e., α1 +α2 +αh = 1. Denote by γ1 (resp. γ2 ) the proportion that all except Alice’s (resp. Bob’s) Hash power mines after Alice’s (resp. Bob’s) released chain in the tie-breaking between Alice (resp. Bob) and Henry. Denote by θ1 and θ2 the probabilities that honest miners choose to mine after Alice’s and Bob’s chains in the three-party tie-breaking, respectively. When the blockchain system creates a new block, it is mined by pool i with the probability αi , ∀i ∈ {1, 2, h}, owing to the memorylessness of Hash computations. discovered, while Alice and Bob will decide whether to release their mined blocks depending on the length of the public chain. • (Forfeit case) Alice (resp. Bob) abandons her (resp. his) private chain and conforms to mining after the public chain if the latter is longer. Henry also abandons his public chain if Alice or Bob publishes a longer chain. • (Risk-avoiding release case) Alice (resp. Bob) releases her (resp. his) privately mined blocks to the public if the new block is mined by the others and the leading advantage of her private chain is no more than two blocks. • (Chain reaction case) When Alice (resp. Bob) releases her (resp. his) blocks to public chain, the release of Bob’s (resp. Alice’s) private blocks is triggered immediately. The chain reaction case is the combination of the forfeit and the risk-avoiding cases, whereas the existence of chain reaction complicates evolution of the public chain. Suppose that Alice publishes her private blocks to obsolete the current public chain. After the construction of the new public chain, Bob may release his private chain to forfeit it immediately. C. Release procedure and tie-breaking Logics The consensus on the public chain requires that it is the longest. We illustrate the evolution of private and public chains using examples, where Ak , Bk , and Hk denote that the k th block belongs to Alice, Bob and Henry respectively. Risk-avoiding release case. We show the risk-avoiding release of Alice’s private chain in Fig.1 where the blocks of private chains are in grey and those of public chains are in white. Alice is only one block ahead of Henry after the latter mines a new block for the public chain. Because Alice fears losing the competition, she publishes her private blocks, obsoleting Henry’s public chain, so that both Alice and Henry mine on the new longest chain afterward. B. Basic Selfish Mining Mode Alice maintains a private chain, and so does Bob, while Henry operates on the public chain. Alice and Bob are not aware of each other’s role. We suppose that all the miners work on the same public chain at the beginning where the starting point is expressed as “0”. The length of the private chain is kept as private information by attackers, and the length of the public chain is observed by all of them. We consider the selfish mining method proposed by [3], and our analytical approach can be generalized to a variety of other methods. The mining procedure consists of two cases as follows. • (Public-chain mining case) Henry always mines after the public chain. Alice or Bob also mines on the public chain if it is longer than his private chain. • (Private-chain mining case) Alice (resp. Bob) continues to mine on her (resp. his) private chain if she (resp. he) discovers a new block and the private chain is now longer than the public chain. The release procedure is more complicated than the mining procedure. Henry broadcasts his mined block as soon as it is Fig. 1. Alice’s risk-avoiding release and Henry’s abandonment. The blocks of private chains are in grey and those of public chains are in white. Tie-breaking resolving. In Alice’s case, if she only hides one block, Henry may catch up with her. When it happens, Alice publishes her private blocks to compete with Henry. Thus, two public chains of the same length exist in Fig. 2. Since only one public chain prevails, a tie-breaking rule needs to be taken into account. The case is that the public chains of Alice and Henry have the same length, and Bob’s private chain is 0. Hence, we only need to resolve the tie between Alice and Henry. All the miners are possible to mine after block A1 , while Bob and Henry may mine after H1 . There are five possibilities of extending the longest public chain, and the shorter one will be obsoleted. If Alice and Bob hide one block respectively, there will be three public chains with the same length. Alice and Bob will mine after their own chain, and Henry will choose one chain randomly. More details about tie-breaking with three public chains are described in Appendix A. 4 public chain, go through a series of block-hiding and blockreleasing actions, and finally reach a consensus on the public chain again. Definition 1. (Relative Revenue) Let Ra , Rb and Rh be the expected numbers of valid blocks mined by Alice, Bob and Henry in a mining round, respectively. The relative revenue of a miner, R̂i , is expressed as R̂i = Fig. 2. Tie-breaking case of two public chains. Fig. 3. Chain reaction case. Chain reaction release. We next introduce the chain reaction release. Note that the chain reaction release consists of a sequence of risk-avoiding releases and tie-breaking resolvings. Fig. 3 illustrates an example of how the chain reaction phenomenon is triggered. First, Alice’s private chain contains four blocks, while the lengths of Bob’s private chain and Henry’s public chain are 0. This process is marked as ¬ in Fig. 3. After a tie-breaking resolving at , the longer public chain contains two blocks B1 and H2 , and the shorter one is orphaned. Bob constructs a new private chain starting from B3 to B8 at ®, while Henry continues to mine one block after H2 at ¯. From Alice’s perspective, her private chain is merely one block ahead of the public chain. She releases her private blocks in order to avoid the risk of losing the race with Henry. The new public longest chain now starts from block A4 . Next, ° and ± constitute a new round of tie-breaking resolving between Alice and Henry, extending the public chain to block A7 . However, the release of A7 triggers Bob to release all of his private blocks starting from B3 to B8 . When retrospecting the whole mining process, we observe that the winning branch switches back and forth, making the analysis of selfish mining extremely complicated. To be noted, the chain reaction occurs only when the length of the private chain is greater than three. Ri , Ra + Rb + Rh i ∈ {a, b, h}. It should be emphasized that the valid blocks are confirmed blocks in the longest chain. Selfish mining attack will increase the number of orphan blocks that leads to a drop in the valid block generation rate. The Bitcoin mining protocol will adjust the mining difficulty, so that the mining rate of the public chain is kept at one block on average every 10 minutes. Therefore, the expected revenue needs to be normalized. We leave a more rigorous analysis of the relative revenue in Section V. The profitability of selfish mining does not refer to the surplus that the block reward subtracts the cost of cryptographic computation. In fact, it is a contrastive measure to the honest mining that needs an objective index. Definition 2. (Profitability) The selfish or strategic mining performed by Alice (resp. Bob) is deemed profitable if the relative revenue is higher than the normalized Hash power, i.e. R̂a > α1 (resp. R̂b > α2 ). B. Stationary Analysis for Two Attackers In order to analyze the profitability of selfish mining, we need to capture the states of the system which satisfy the Markov property, including the block generation and block release. We hereby formulate a discrete-time Markov chain model similar to [3] to characterize the dynamics of the public and private chains. We begin with the assumption that each selfish miner will release his two private blocks immediately after he has mined the second one (i.e. N = 2). The underlying reasons are two-fold. Firstly, the simpler block-release process avoids the complicated representation of Markov states, thus allowing more tractable mathematical modeling. Secondly, the bursty release of many valid blocks in a very short time usually indicates the existence of selfish mining attacks that can be easily detected. III. S TATIONARY A NALYSIS OF BASIC S ELFISH M INING In this section, we present Markov chain models to characterize the block-publishing dynamics with multiple selfish miners. The expected revenues are derived in explicit form. A. Definition The theme of our study is placed on the profitability of selfish mining so that the profitable measures should be clarified first. For notational simplicity, we only consider three miners: Alice, Bob and Henry. We denote the mining round as the process in which all miners start mining on the same Fig. 4. Markov chain with less than two private blocks. We define a state as a three-tuple consisting of the lengths of Alice’s, Bob’s and Henry’s chains. Fig. 4 illustrates all 5 the states, the state transition indicators and their transition probabilities. For instance, the transition from state (0, 0, 0) to state (1, 0, 0) means that Alice discovers a valid block with probability α1 and forks the private chain. If the maximum length of any private chain is below 2, Alice and Bob can hide their private chains and continue the selfish mining. All the transitions to state (0, 0, 0) mean that the forked chains return to the unanimous public chain and a new round of selfish mining starts. Denote by P the state transition probability matrix and by Pss0 the transition probability from state s = (i, j, k) to s0 = (i0 , j 0 , k 0 ). Let πijk be the stationary distribution of state (i, j, k). According to the detailed balance equation [18] [19] X πs0 Ps0 s . πs = (1) As a special case that both selfish miners are homogeneous, i.e. α1 = α2 = α < 0.5, γ1 = γ2 = 0.5 and θ1 = θ2 = 1/3, the expected revenues can be simplified as π000 = (1 + 4α − 4α3 )−1 ; Ra = Rb = π000 · 16 α(25α + 2α2 + 3 − 32α3 ); (12) 3 Rh = π000 · [(1 − 2α)(1 + 3α − 73 α2 − 16 3 α )]. (13) We can easily observe that the attackers’ (resp. Henry’s) expected revenues in Eq. (12) (resp. Eq. (13)) monotonically increase (resp. decrease) with regard to attackers’ ratios of Hash power. C. Scaling to Multiple Attackers Although the profitable selfish mining demands a high Hash We calculate the probability distribution of states and obtain power, it is possible that multiple selfish miners opt in. The the following equations: profitability of more selfish miners is obscure. The honest π000 + π100 + π010 + π110 + π101 + π011 + π111 = 1. (2) miner’s share of Hash power decreases, and the competition among selfish miners becomes more fierce. Therefore, we Then, we can compute π000 as consider a general mire scenario with m (m > 2) BSM miners. −1 We model the dynamics of the public and private chains π000 = (1+α1 +α2 +α1 αh +2α1 α2 +α2 αh +2α1 α2 αh ) , (3) as a Markov process likewise. Fig. 5 illustrates all the states and πijk at any other state s = (i, j, k) in the same way. and their transition probabilities. Recall that the assumption of the maximum private chain length also holds. Each state The transitions to state (0, 0, 0) manifest which miner is is expressed as a m-tuple, i.e. L = {l1 , l2 , · · · , lm } with li ∈ the final winner in the current round of selfish mining. There- {0, 1}, which consists of the lengths of each attacker’s private fore, we can compute the expected revenues of Alice, Bob chain. The state with m zeros, denoted as L0 , indicates the and Henry that are defined as Ra , Rb and Rh respectively. start of the mining competition. The states with a single ‘1’ Facilitated by the stationary state distributions, we calculate are grouped together in which one and only one attacker has them as below, mined a valid block to build his private chain. Similarly, the states with k elements of ‘1’ indicate that the private chains Ra = 2α1 π100 + (2α1 + (1 − α1 )γ1 )π101 + α1 π011 (4) of k out of m attackers have one valid block. Formally, we + 2α1 π110 + (2α1 + αh θ1 )π111 ; denote by L the set of all states with the cardinality 2m , and Rb = 2α2 π010 + (2α2 + (1 − α2 )γ2 )π011 + α2 π101 (5) by Lk ⊆ L the subset in which k selfish miners hold their private blocks. + 2α2 π110 + (2α2 + αh θ2 )π111 ; s0 Rh = (αh + (1 − α1 )(1 − γ1 ))π101 + αh (2 − θ1 − θ2 )π111 (6) + (αh + (1 − α2 )(1 − γ2 ))π011 + αh π000 . Then we can obtain the expected revenue in closed form as: Ra = π000 · [2α12 (1 + αh ) + (α2 + αh ) α1 αh γ1 + α1 α2 αh + 4α12 α2 (1 + αh ) + 2α1 α2 αh2 θ1 ]; Rb = π000 · [2α22 (1 + αh ) + (α1 + αh ) α2 αh γ2 + α1 α2 αh + 4α22 α1 (1 + αh ) + 2α1 α2 αh2 θ2 ]; Rh = π000 · [α1 αh2 (2 − γ1 ) + 2α1 α2 αh2 (2−θ1 − θ2 ) + αh + α2 αh2 (2 − γ2 ) + α1 α2 αh (2 − γ1 − γ2 )]. (7) (8) (9) The relative revenue of each miner can be given by: R̂i = Ri , ∀i ∈ {a, b, h}. Ra + Rb + Rh (10) The average number of Henry’s orphaned blocks in each attack round is calculated as: Oh =π000 [(α1 +(1−α1 )γ1 )α1 αh + (α2 +(1−α2 )γ2 )α2 αh + (α1 + α2 + (θ1 + θ2 )αh )2α1 α2 αh ]. (11) Fig. 5. Markov State Transition with m attackers. The state transition probabilities are described as the following. The blockchain system can reside at the initial state with probability αh , i.e., the block is mined by the honest miner. Denote by ei the vector whose ith element is 1 and all others are zero. When attacker i discovers a valid block, 6 the blockchain system transits to a new state (L + ei ) with probability αi . At state L ∈ Lk , if the ith attacker who has a private block finds a valid block again, the blockchain system returns to the initial state L0 (in which the state is expressed as a double circle). Otherwise, it jumps to a state in Lk+1 with the probability equivalent to the relative Hash power of the selfish miner. The state transition probabilities can be expressed as: PL0 L0 = αh , 0 PLL0 = αi , ∀L ∈ Lk , and L = L + ei ∈ Lk+1 , PLL0 = αi + αh , PL0 L0 = 1, ∀L ∈ Lk , L0 ∈ / Lk+1 . Using the detailed balance equations, we can compute the stationary distribution of each state. Specifically, the stationary probability at state S0 is computed explicitly as πL 0 = 1 + m X X k! m Y (αj · 1lj =1 ) −1 . (14) j=1 k=1 L∈Lk m X Ri 1+α k = Akm αk [ (2α + αh ) + αh α], πL0 m k+1 (21) k=1 Hence, the relative revenue can be rewritten as R̂i = Ri /(m · Ri + Rh ), ∀i ∈ {1, · · · , m}, Rˆh = Rh /(m · Ri + Rh ). (22) (23) The revenue of all miners when each private chain can hide more than one block is modeled and the detailed results are shown in Appendix B. IV. O PTIMAL STRATEGY UNDER MULTIPLE ATTACKERS The basic selfish mining (BSM) policy restricts the choices of withholding and releasing blocks. In this section, we present the more profitable selfish mining strategy (POMDP-based mining policy) for two attackers when one of them chooses the basic selfish mining and the other chooses to be strategic. The stationary probability at state L is given by: πL = k! · πL0 · m Y A. MDP-based policy for an Upper Bound of Revenue (αj · 1lj =1 ), ∀L ∈ Lk . (15) The limitations of basic selfish mining are intuitive. An attacker is “conservative” to adopt the public chain when The revenues of all miners are computed based on the his private chain slightly lags behind it, and is “less wise” stationary state distribution and the particular transition paths to override the public chain when it is catching up. The to state L0 . If all miners have the same γ in tie-breaking cases, optimal selfish mining problem with a single attacker was raised in [5] [11] that modeled the mining race as a Markov their revenues can be written as: P  decision process. This strategic selfish mining policy lowers 1− αj ·1lj =1 m X  πL · [2αi + αh (2αi + j∈L X )], l = 1; the profitable threshold of Hash power. i k+1 Ri = The optimal selfish mining in the presence of two attackers  k=1 L∈Lk  αh · πL · αi , li = 0. (Alice adopts the more profitable mining policy and Bob adopts the BSM mining policy) is far more challenging than P that with a single attacker. First, the state and state transition 1− αj · 1lj =1 m X X j∈L Rh = αh πL (αh + ) + πL0 · αh . are greatly augmented. Alice has to incorporate the status of k+1 multiple chains in the state other than merely the lengths k=1 L∈Lk (16) of the chains. Second, Alice cannot acquire the information regarding Bob’s private chain. To tackle these difficulties, we The relative revenue of each miner can be given by: begin with the assumption that Alice has full information about Ri the private chain of Bob. The optimal policy of Alice with R̂i = P , ∀i ∈ {1, · · · , m, h}. (17) m full information can be solved based on an MDP model (OPT Rj + Rh policy), and the corresponding revenue will be used as the j=1 upper bound of revenue when the private chain of Bob is As a special case that all selfish miners are homogeneous and unknown. Meanwhile, the MDP model offers the principle of their Hash power is α, the stationary probability at state S0 designing optimal policy with partially observable states. can be simplified as 1) Main components: We formulate an MDP model for m X the strategic attacker as the four-tuple M =< S, A, P, R > −1 πL0 = 1 + Akm αk , (18) where S denotes the state space, A denotes the action space, k=1 P corresponds to the transition matrix, and R corresponds to k where Am is k−permutations of m. The stationary probability the reward matrix. State: The state space S is defined as a 10-tuple in the at state L can be represented as form < loc, f ork, l1 , l2 , h1 , h2 , h3 , u1 , u2 , u3 >. The attribute πL = k! · πL0 · αk , ∀L ∈ Lk . (19) loc ∈ {1, 2, 3} indicates the branch that Henry is working on. If loc = 1 (resp. loc = 2), Henry is mining on the public chain We can obtain the expected revenue of each miner as that also contains Alice’s (resp. Bob’s) blocks at the current m X Rh 1 − kα k k = Am α αh (αh + ) + αh . (20) mining round. If loc = 3, the longest public chain contains πL 0 k+1 only Henry’s blocks. Note that Alice’s and Bob’s blocks are k=1 mutually exclusive on the public chain at the same mining round because one attacker will not accept the blocks of the other before this attack round ends. The attribute f ork obtains j=1 7 six values, dubbed as {ir, r, f12 , f13 , f23 , f123 }, where r represents that Alice can release blocks to compete with the current public chain when h3 > 0 while ir represents that she can not. If multiple miners are competing on the public chain, f ork takes four values {f12 , f13 , f23 , f123 }, indicating that (Alice, Bob), (Alice, Henry), (Bob, Henry) and (Alice, Bob, Henry) are in the competition respectively. The notation h1 indicates the distance between the starting position believed by Alice and the real starting position. Similarly, h2 and h3 indicate these distances of Bob and Henry. We record h1 , h2 and h3 because the blocks between the real and the believed starting positions influence which chain will prevail finally and how the revenues are calculated. The notation l1 (resp. l2 ) represents the number of unreleased blocks at Alice’s (resp. Bob’s) private chain. µ1 denotes the number of Henry’s blocks between the real starting position and the Alice-believed starting position. µ2 and µ3 are defined for Bob and Henry in the same way. Similar to [5], we limit the lengths of private and public chains in a mining round so as to confine the size of state space, i.e. l1 , l2 ≤ N and h3 ≤ h3,max . Action: An action is the number of blocks that Alice publishes under a particular state. We define Alice’s action space A as A = {adopt, 0, 1, · · · , l1 } in which adopt means that Alice gives up her private chain, 0 means that Alice chooses to wait, and l1 is the current number of blocks held by Alice privately. The action taken by Alice has certain reasonable restrictions: if l1 reaches N , Alice must release at least one block; if h3 reaches h3,max , Alice either chooses “adopt” or to release no less than (h3 − h1 ) blocks to end this mining round. State Transition: We define the state transition function as Pr(s0 |s, a ∈ A), the probability that the state s jumps to s0 under action a. The state transition is triggered by the mining of a new block, and is determined by who discovers it. All the transitions are summarized in Appendix Table III. Reward Function: The purpose of “optimal” mining is to acquire a larger share of confirmed blocks on the public chain. Recall that the relative revenue of a miner is the fraction of his blocks on the public chain for a long period. Obviously, the relative revenue cannot be measured under each state-action pair, and cannot be taken as the corresponding immediate reward. Sapirshtein et al. [5] transform the (longterm) relative revenue into the family of (one-shot) absolute revenues parameterized by the weight ρ ∈ [0, 1]. This ingenious transformation operates as the following. Define a transformation function wρ : N3 → R related to Alice’s instantaneous reward: wρi (r1i , r2i , rhi ) = (1 − ρ) · r1i − ρ · (r2i + rhi ), (24) where r1i , r2i and rhi represent the instantaneous rewards of Alice, Bob and Henry at step i (analogous to time t in classical MDP). We reformulate the original MDP model as Mρ =< S, A, P, wρ (r1 , r2 , rh ) >. The underlying reason of such transformation is that instead of maximizing the relative revenue, we choose to maximize the expected fictitious reward wρ (r1 , r2 , rh ). For any admissible policy π, the mean reward denoted by vρπ is characterized as: ξ 1X π wρ (r1i (π), r2i (π), rhi (π))], vρ = E[ lim ξ ξ→∞ i=1 (25) where ξ is the total number of state transition steps. The optimal revenue vρ∗ is given by vρ∗ = max {vρπ }. π (26) The equivalence between two MDPs M and Mρ is indirect. Sapirshtein et al. [5] present two propositions to guarantee their equivalence for a single selfish miner. ∗ ∗ • If vρ = 0 for some ρ ∈ [0, 1], then any policy π obtaining this value also maximizes the relative revenue, and the relative revenue equals to ρ. ∗ • vρ is monotonically decreasing in ρ. The above propositions tell us that by searching for an appropriate ρ that yields the mean reward of Mρ to be 0, we can obtain the optimal relative revenue of Alice in M. We generalize this idea to the MDP with multiple selfish miners. In addition, the maximum number of steps, ξ, is truncated to avoid excessive computations by tolerating a very gentle loss in the optimal mean reward vρπ . 2) Algorithm: Owing to the monotonicity of vρ∗ to ρ, a binary search of ρ ∈ [0, 1] is adequate. For a given ρ, we utilize the value iteration method to solve the optimal policy πρ∗ as [5]. Compared with policy iteration, the advantage of value iteration is its fast convergence rate especially in largescale MDPs [20] [21]. B. POMDP-based policy for Optimal Mining The OPT framework provides important insights into the optimal mining policy in the presence of two attackers, yet its real world deployment is unrealistic. The strategic miner Alice is assumed to know precisely the system state, while in reality Bob’s private chain is not observable to Alice, and the blocks on the public chain released by Bob and Henry cannot be differentiated because of their anonymity. In light of the incomplete state information, we reformulate the optimal mining as a Partially Observable Markov Decision Process (POMDP). Before diving into details, we enumerate three key challenges: • which subset of information is non-observable to Alice; • how the mining event-driven MDP model is generalized to the POMDP model; • how the POMDP-based optimal mining policy can be computed efficiently. 1) Main components: The POMDP model is expressed as a six-tuple MP O :=< S, A, P, R, O, Z >, where O is Alice’s observation space, Z(·) is the observation function, and the remaining components inherit the same meanings as their counterparts in MDP. Observable Information: In the partially observable environment, Alice cannot obtain all the attributes in S. The length of Bob’s private chain l2 cannot be observed absolutely. The attribute loc is non-observable either because Alice is unaware of whether Henry is mining on his own chain or 8 Bob’s chain. h2 , µ2 and µ3 are non-observable for that the anonymity of mining covers up the owners of the blocks on the public chain. The attribute f ork is observable because ir and r are pertinent to Alice’s private chain, and the values f∗ ∈ {f12 , f13 , f23 , f123 } is obtained by counting the number of focks in competition. l1 , h1 and µ1 are known for sure; h3 is actually the length of the longest public chain. In summary, the observation space is represented as O :=< f ork, l1 , h1 , h3 , µ1 >⊂ S. Given the same observation, Alice is likely to be in many possible states. State Transition: The state transition function is also denoted as Pr(s0 |s, a ∈ A). Though taking the similar form, MP O possesses a different state transition logic from M. In M, an action is triggered by the discovery of a new block, and the state transition follows. In MP O , the pure event-driven state transition will restrict Alice from participating in the fork competition. For instance, if Bob hides one block, Alice should publish her private block while there is no observable event to trigger a fork competition. On the contrary, if Bob does not have any private block while Alice believes that the length of Bob’s private chain is 1, Alice may publish one or two blocks unnecessarily. Therefore, one can see that a time-slotted plus event-driven POMDP model is appropriate to handle the intricate optimal mining problem. Due to the memoryless Hash computation [22], the block arrival process is actually a stationary stochastic process. If we slice this stochastic process equally with a slot duration ∆t, the number of mined blocks in every slot has the same distribution. By choosing a relatively small ∆t, we suppose that at most one block is mined in each slot (the chance of mining two or more blocks is rarer by orders of magnitude). Denote by p the probability of generating a block in one time slot. The probabilities that Alice, Bob and Henry generate it are α1 p, α2 p and αh p respectively. Alice can estimate the Hash power α2 and αh through mining honestly for a certain period. It is worth highlighting that our POMDP model makes the optimal mining with partially observable states feasible, and is in accordance with realistic blockchain systems. All the transitions are summarized in Appendix Table IV. Observation function: Define Z := S × A → ∆(O) as the observation function that specifies the relationship between system states and observations. Here, z(s, a, o) is the probability that observation o will be reached after Alice performs action a and lands in state s: zt+1 (s, a, o) = Pr(ot+1 = o|st+1 = s, at = a). (27) In MP O , the observation of a state is certain, i.e, s =< loc, f ork, l1 , l2 , h1 , h2 , h3 , µ1 , µ2 , µ3 >, o =< f ork, l1 , h1 , h3 , µ1 >, zt+1 (s, a, o) = 1. (28) Belief: Our POMDP model is pertinent to a belief b which is a probability distribution over all the possible states. Intuitively, Alice makes a guess on the current state iteratively. The belief on a particular state s at time t is given by: b(s) = Pr(st = s|ot , at−1 , ot−1 , · · · , a0 , b0 ). (29) The updated belief state b0 (s0 ) is calculated whenever the action a is taken and the observation o is perceived. P Pr(o|s0 , a) s Pr(s0 |a, s)b(s) 0 0 0 , (30) b (s ) ≡ Pr(s |a, o, b) = Pr(o|a, b) P where s∈S b(s) = 1 and Pr(o|a, b) is a normalization factor given by X X Pr(o|a, s0 ) Pr(s0 |a, s)b(s). (31) Pr(o|a, b) = s0 s Reward function: We use r1i (s, a), r2i (s, a) and rhi (s, a) to denote the instantaneous rewards of Alice, Bob and Henry at step i when Alice takes action a at state s. Due to the uncertainty of system state, the expected reward functions r1i (b, a), r2i (b, a) and rhi (b, a) are constructed on the belief of instantaneous rewards, ∀a ∈ A, X rki (b, a) = bi (s)rki (s, a), k ∈ {1, 2, h}. (32) s∈S Similar to the transformation in the MDP model, we replace the relative reward by the absolute reward wρi (r1i (bi , a), r2i (bi , a), rhi (bi , a)) parameterized by ρ using Eq. (24). We next formulate the optimal mining as a finite-horizon average reward POMDP problem as [5] and [11]. The expected average value function is defined as ξ 1X π vρ = E[ lim wρ (r1i (bi , π), r2i (bi , π), rhi (bi , π))]. ξ→∞ ξ i=1 The optimal policy π ∗ is a set of decision rules depending on belief-state pairs: π ∗ = arg max{vρπ }. π∈A (33) The parameter ρ that solves vρ∗ = 0 is the relative revenue of Alice under the POMDP model. In practice, the sum of revenues over ξ is truncated by a sufficiently large number ξ0 . Given a precision threshold , ξ0 needs to satisfy: ξ0 |vρπ − E[ 1 X wρ (r1i (π), r2i (π), rhi (π))]| ≤ . ξ0 i=1 (34) C. Algorithm A POMDP is essentially an expanded MDP defined on belief space. However, the belief space is a high-dimensional continuous space that needs to be segmented into a huge number of belief states. An offline POMDP algorithm will compute the optimal actions at every belief state. Considering a large-scale POMDP like ours, the offline computation is time-consuming because of generating the rewards, updating the beliefs and constructing the optimal policy at each belief. For efficiency considerations, we propose using the online POMDP algorithm that explores the future belief states reachable from the current belief state. The policy construction time is often substantially shorter. Furthermore, three properties can be used to reduce the time of searching for ρ. Lemma 1. Under the same parameter setting, the optimal result of M is the upper bound of MP O . 9 Proof: Among the approximation algorithms of the POMDP problem, the MDP approximation consists in approximating the value function of the POMDP by the value function of its underlying MDP [23]. This value function is an upper bound on the value function of the POMDP [24]. Lemma 2. The revenue obtained under the optimal policy based on any ρ is the lower bound of the actual optimal revenue for MP O . Proof: The value of ρ does not affect the state transition and the instantaneous reward. All the mining policies under any ρ are available, even though they are not optimal. Hence, we record the number of blocks that each miner obtains at every time step, i.e, (r1i , r2i , rhi ). After enough time steps, we can compute the relative revenue under the current ρ even though it is not optimal. Therefore, the revenue of the optimal policy obtained under the current ρ is the lower bound of the actual optimal revenue. Lemma 3. There exists an optimal stationary deterministic policy for MP O model. Proof: The states, actions and beliefs are countable because the maximum lengths of private and public chains are limited. Since a POMDP problem can be regarded as a belief MDP, the existence of an optimal stationary policy for our POMDP problem is guaranteed by Theorem 7.3.6 of [25]. Algorithm 1 Algorithm for solving the POMDP. Input: MP O , M, a truncation parameter ξ0 , a precision value ; The initial belief bc ; Output:ρ; Static: a∗ : optimal action; 1: low ← 0; 2: ρ∗ =QMDP(M); 3: upper ← ρ∗ ; 4: while upper − low > do 5: ρ ← (low + upper)/2 6: RESULT← {}; 7: r1 ← 0, r2 ← 0, rh ← 0; 8: vρ ← 0, ξ ← ξ0 ; 9: bc ← initial belief 10: while ξ > 0 do 11: if bc ∈ RESULT then 12: a∗ =RESULT[bc ] 13: else 14: a∗ ← AEM S2(bc , MP O ) 15: end if i )← Execute a∗ for b . 16: (r1i , r2i , rh c i; 17: r1 + = rii , r2 + = r2i , rh + = rh i ); 18: vρ + = wρ (r1i , r2i , rh 19: RESULT[bc ] = a∗ 20: Perceive a new observation o 21: bc ← b0 (bc , a∗ , o) B update algorithm is Eq. (30) 22: ξ− = 1 23: end while 24: R10 = r1 /(r1 + r2 + rh ); 25: if vρ > 0 then 26: low ← max(ρ, R10 ); 27: else 28: upper ← ρ; low ← max(low, R10 ); 29: end if 30: end while 31: return ρ; Our online mining algorithm is described in Algorithm 1. We calculate the optimal revenue based on binary search. The upper bound can be set as the result of the underlying MDP model according to Lemma 1 (lines 2-3). Alice will execute the optimal action based on the current ρ and obtain the relative revenue (lines 16 and 24). Her optimal revenue is no less than the relative revenue according to Lemma 2 (lines 25-29). We do not have to recalculate the optimal action at every step according to Lemma 3. The online algorithm will adopt the corresponding action if the same belief state has met. Otherwise, it will calculate and store the new optimal action (lines 11-15). A block of size 1MB needs 18 seconds to reach three thousand nodes in Bitcoin [26]. Making timely decisions is very important. We use AEMS2 [27] [24], one of the fastest POMDP algorithms, to compute the optimal action (line 14). V. T RANSIENT A NALYSIS OF P ROFITABILITY In this section, we first describe the difficulty adjustment algorithm (DAA) in Bitcoin-like systems, and model the revenue of miners in one difficulty adjustment period. We analytically show that the extra revenue of selfish mining is originated from the DAA. A. Bitcoin-like Difficulty Adjustment The essence of Bitcoin mining is to solve a cryptographic puzzle. A Bitcoin miner repeatedly enumerates a NONCE until the head Hash is below the difficulty target. The smaller a target value is, the more difficult the discovery of a valid NONCE will be. For a fixed target difficulty, a larger Hash power means a shorter time of finding a valid NONCE. To maintain a stable block generating interval, Bitcoin introduces difficulty adjustment algorithms (DAAs) to cope with the variable Hash powers in the systems. The Bitcoin DAA is executed after 2016 blocks have been mined. It is actually a feedback control system: if the actual time of mining 2016 blocks is larger than 20160 minutes (10 minutes per block), the target difficulty decreases proportionally, and increases otherwise. When a miner performs selfish mining, a lot of blocks are orphaned so that the actual time to mine 2016 blocks becomes longer. In the next difficulty adjustment periods, the target difficulty is lowered down to maintain the fixed block generating rate. B. Absolute Revenue Previously we define the revenue of a miner in each mining round and his/her relative revenue. However, the duration of a mining round may not be fixed all the time, and the actual number of valid blocks obtained by a miner in each unit of wall-clock time is overlooked. In Bitcoin system, we denote 10 minutes as the unit time, and denote a DAA period as the expected time units of mining 2016 valid blocks. Definition 3. (Absolute Revenue) The absolute revenue is the average number of blocks obtained in each unit time. #1 DAA Period. We treat the first DAA period as the beginning of selfish mining in order to analyze the transient absolute 10 Rvld = E[T1 ] = Ra + Rb + Rh ; 2016 · Rtot . Rvld Rtot = 1; (35) Due to the orphaned blocks, Rtot is greater than Rvld so that the actual time of T1 is longer than 2016 time units. Subsequent DAA Periods. After the first DAA period, the blockchain finds that the time interval of generating a valid block is longer than one time unit. Consequently, the target difficulty decreases to match the current valid Hash power in the system. Given the invariable Hash power of miners, the block generating interval ∆ti becomes smaller for i ≥ 2. Let Ti be the expected time units of the ith DAA period that has Ti = 2016 for i ≥ 2. This is also the goal that DAA is going to achieve. It is noted we assume Rtot ≤ (4 · Rvld ). Absolute Revenue Over Time. Recall that the absolute revenue captures the expected reward of a miner in each time unit. Since our purpose is to investigate the transient profitability of selfish mining, we define R̃i (K) as the absolute revenue of the ith miner over K DAA periods. Therefore, we obtain the following expressions for ∀i ∈ {a, b, h} R̃i (K) = = 2016 · K · Ri 1 · PK Rvld k=1 E[Tk ] KRi . Rtot + (K − 1)Rvld (36) Now we are aware that the selfish mining has a smaller absolute revenue in the first DAA period no matter whether the Hash power of the attacker is above the stationary profitable threshold or not. This claim also holds in different selfish mining policies. As K increases, the absolute revenue is asymptotically close to the relative revenue, which is KRi KRi ≈ . R̃i (K) = Rtot + (K − 1)Rvld Rvld + (K − 1)Rvld (37) This fully justifies the use of relative revenue to represent absolute revenue in previous models. At the same time, a selfish miner can hopefully reimburse his/her loss in the first DAA period by his/her extra revenue in the future DAA periods. With our absolute revenue model, we can characterize how much time is needed to make selfish mining profitable eventually. 0.50 0.45 Bob's Threshold revenues of Alice and Bob. The following normalization rule is made to simplify the analysis by eliminating the randomness of block generating interval. A block is generated every time unit at the first difficulty adjustment period, i.e., ∆t1 = 1. With the above assumption, we can easily compute the duration of generating 2016 valid blocks. Let Rvld be the total number of valid blocks of all miners in a mining round, and let Rtot be the total number of blocks including the valid and orphan blocks in the same round. This means that a mining round occupies Rtot time units. Denote by T1 the expected time units to accomplish the first DAA period. One can calculate the time of mining rounds that the total number of valid blocks reaches 2016. Then, T1 is equivalent to the sum of time units of these mining rounds. There exists 0.40 0.35 (0, 0.334) Theory w/ N=2 Theory w/ N=3 Theory w/ N=4 α1 = α2 Simu w/ N=2 Simu w/ N=3 Simu w/ N=4 0.30 (0, 0.281) (0, 0.265) (0.27, 0.27) 0.25 0.20 0.0 (0.23, 0.23) 0.1 (0.2148, 0.2148) 0.2 0.3 Alice's Hashrate 0.4 0.5 Fig. 6. Bob’s threshold under the influence of Alice’s Hashrate. VI. E VALUATION In this section, we develop an event-driven simulator for basic selfish mining and a time-driven simulator for POMDPbased selfish mining 3 . Comprehensive experiments validate the correctness of our models and reveal important properties regarding the profitability of selfish mining [28]. A. Basic Selfish Mining Observation 1. When there are multiple attackers in Bitcoinlike systems, the attackers’ profitable thresholds decrease and the system security is degraded. We illustrate Bob’s profitable threshold of selfish mining in Fig. 6 as Alice’s Hash power increases from 0 to nearly 0.5. To avoid involving too many control variables, the tiebreaking parameters are set to γ1 =γ2 = 21 and θ1 =θ2 = 31 . One can observe from three curves with different N that Bob’s profitable threshold decreases at first and increases afterward. When N = 2 and α1 = 0.16, Bob’s profitable threshold is the lowest. Under this situation, Alice’s selfish mining may yield less revenue compared with her honest mining. We further draw a 45◦ line to indicate the profitable threshold for both Alice and Bob when their Hash power is symmetric. When N is 2, 3 and 4, the profitable threshold is 26.64%, 22.57% and 21.48%. In contrast, it takes the value of 33.33%, 28.08% and 26.50% respectively, if there is a single attacker. An obvious conclusion is that the existence of multiple attackers makes selfish mining more easily profitable. Observation 2. The profitable threshold decreases with the increase of N , and remains stable with the attackers of symmetric Hash power as N ≥ 4; it also decreases when the number of attackers m increases. We evaluate the profitable thresholds of BSM with different N and m. Our purposes are twofold: one is analyzing the interplay between this threshold and the environment variables, and the other is justifying the use of N ≤ 4 in the mathematical modeling. The Hash powers of all the attackers are identical, and the competing chains are indistinguishable upon the tiebreaking rules. Fig. 7 shows the relationship between N and the profitable threshold. The cases with 1, 2, 3 and 4 symmetric attackers are 3 Mining strategies can be implemented in real systems possibly because miners’ decision-making relies on public information. One attacker Two attackers Three attackers Four attackers 5 10 15 20 25 30 35 N 0.45 0.40 0.35 0.30 0.25 0.20 0.35 Sim lation threshold w/ N=∞ Sim lation threshold w/ N=2 Sim lation threshold w/ N=3 Sim lation threshold w/ N=4 Theoretical threshold w/ N=2 0.30 0.25 Threshold 0.325 0.300 0.275 0.250 0.225 0.200 0.175 0.150 Henry's orphan block ratio Threshold 11 0.20 0.15 α1 = α2 = 0.22 α1 = α2 = 0.25 α1 = α2 = 0.27 0.10 2 3 4 5 6 7 8 9 N 1 2 3 4 5 6 7 8 Number of attackers 0.30 Theory θ w/ N=2 Theory θ w/ N=3 Theory θ w/ N=4 Simu θ w/ N=2 Simu θ w/ N=3 Simu θ w/ N=4 0.28 Alice's Threshold Alice's Threshold 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0.26 Theory γ w/ N=2 Theory γ w/ N=3 Theory γ w/ N=4 Simu γ w/ N=2 Simu γ w/ N=3 Simu γ w/ N=4 0.24 0.22 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 γ Fig. 10. Profitable threshold for Alice with γ. 0.20 0.1 0.2 0.3 θ 0.4 0.5 Fig. 11. Profitable threshold for Alice with θ. expressed in solid, dashed, dash-dotted and dotted lines. One can observe that the profitable threshold decreases remarkably for different m as N increases from 2 to 4. The event-driven simulations exhibit a stable profitable threshold when Alice and Bob can hold more than 5 private blocks. For instance, this threshold converges to 25% for a sufficiently large N with a single attacker, which is in line with [3]. With two symmetric attackers (m = 2), its value is 20.60% at N = 30, slightly different from that at N = 4. In general, Alice’s or Bob’s Hash power is much smaller than Henry’s Hash power. The chance that Alice’s or Bob’s private chain takes a large lead over the public chain in a mining round is very small. Hence, it does not make an evident influence on the profitable threshold when N is already large. Moreover, hiding a long private chain and releasing all the blocks simultaneously will make the selfish mining attack easily detected. Fig. 8 shows Henry’s orphan block ratio as N increases from 2 to 9 with m = 2. Even though Alice’s and Bob’s Hash powers are merely (0.22, 0.22), they cause a very high orphan block ratio to Henry, e.g., 18.73% with N = 2, 28.53% with N = 4 and 31.66% with N = 9. Such a high orphan ratio can easily expose the identity of attackers. Therefore, our modeling framework only considers N ≤ 4 though it is extensible to N > 4. We use mathematical models and event-driven simulations to quantify the impact of m on the profitable threshold in Fig. 7 and 9. One can observe that the increase in the number of attackers reduces the profitable threshold, thus endangering blockchain security. For N = 4, the profitable thresholds with m ∈ {2, 4, 8} are {0.2148, 0.155, 0.11}. This challenges the cognition that selfish mining is less likely to happen if the Hash power of a miner is below 25%. The primary reason that more attackers lead to smaller profitable thresholds lies in that the Hash power of the honest miner declines relatively. Meanwhile, our model coincides with the simulation result at Orphan block ratio for Henry Fig. 7. Profitable threshold vs the maximum number Fig. 8. Henry’s orphan block ratio vs the maximum Fig. 9. Profitable threshold vs the number of attackof private blocks (N ). number of private blocks (N ). ers (m). 0.275 0.250 0.225 0.200 0.175 0.150 0.125 0.100 Theoretical result w/ N=2 Theoretical result w/ N=3 Simulation result w/ N=2 Simulation result w/ N=3 0.20 0.25 0.30 0.35 0.40 0.45 Attcker's Hashrates Fig. 12. Estimating Bob’s Hash power by observing orphan block ratio. N = 2 in Fig. 9, thus validating its correctness. In Appendix C, we further explore the impact of N on the BSM attackers’ revenues when the attackers’ Hash powers are asymmetric. Observation 3. The profitable threshold decreases with the increase of γ and θ. When γ1 = γ2 = 1 and N = 2, BSM homogeneous attackers can obtain extra revenue with very small Hash power. We also explore the impact of different information propagation delays on the profitable threshold. The information propagation delay determines the proportion of other miners’ Hash power after the attacker’s released chain in the tiebreaking, i.e., γ1 , γ2 , θ1 and θ2 . We will study the profitable threshold of two homogeneous attackers under different information propagation delays, i.e., γ1 = γ2 = γ and θ1 = θ2 = θ. Fig 10 shows the results when θ = 1/3, the profitable threshold will decrease with the increase of γ. When γ increases to 1, the attackers’ profitable threshold tends to 0. This means that when N is relatively small, the attacker can obtain extra revenue through BSM with a very small hash power if γ = 1. Fig. 11 shows the profitable threshold decreases with the increase of β when γ = 1/2. It can be observed the less propagation delay leads to more revenues for the attacker. Since the advent of the seminal work [3], the Bitcoin community tries to constrain mining pools to possess less than 25% of Hash power. However, we prove that 25% is not enough: Bitcoin mining is fragile in the presence of multiple selfish miners. B. MDP and POMDP-based Mining The roadmap of performing optimal mining is the following. We explore its feasibility by estimating the unknown parameters. Then, we compute the optimal mining policy 12 0.25 0.20 # of OPT # of BSM # of POMDP 0.9 # of POMDP 0.8 # of POMDP 0.5 0.34 0.36 0.38 0.40 0.42 Bob's Hashrate Fig. 13. Profitable threshold for Alice when N = 2. 0.50 0.45 0.40 0.35 0.30 0.25 0.20 # relative revenue # absolute revenue Revenue Alice's threshold 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.15 0.10 # of OPT # of BSM # of POMDP 0.9 # of POMDP 0.8 # of POMDP 0.5 0.05 0.30 0.32 0.34 0.36 0 40 Round 80 Fig. 14. Profitable threshold for Alice when N = Fig. 15. Relative revenue and absolute revenue 3. when α1 = α2 = 0.33, N = 4. Fig. 16. The revenue for Alice when α2 = 0.3, Fig. 17. The revenue for Alice when α2 = 0.35, Fig. 18. The revenue for Alice when α2 = 0.4, N = 2. N = 2. N = 2. Fig. 19. The revenue for Alice when α2 = 0.26, Fig. 20. The revenue for Alice when α2 = 0.28, Fig. 21. The revenue for Alice when α2 = 0.3, N = 3. N = 3. N = 3. and the corresponding revenue using MDP, and on this basis we compute the optimal mining policy using POMDP under partially observable states. Recall that Alice is the strategic attacker with POMDPbased policy and Bob is the basic selfish miner (BSM). Alice needs to compute the proportion of her Hash power to the global Hash power based on the public information [29]. Then she needs to decide whether there exists a selfish miner namely Bob, and if so, what Bob’s Hash power is. Note that there has been a one-to-one mapping between Henry’s orphan block ratio and Bob’s Hash power. Fig. 12 shows the orphan ratio of the honest miner as a function of Bob’s Hash power with N = 2 and N = 3. The theoretical and experimental results match well, which indicates the feasibility of calculating Bob’s Hash power through the observed orphan block ratio. After that, we compute the revenue upper bound of POMDP-based policy using MDP. Analogous experiments of MDP-based mining can be found in Appendix D. Then, we will focus on the performances of POMDP-based mining policy. Let the error parameter = 0.00001 and the execution number ξ0 = 1000000. Observation 4. When Alice uses the POMDP-based policy, and Bob uses the basic selfish mining, Alice has a much lower profitable threshold compared with the basic selfish mining. Her revenue is no less than the honest mining and the basic selfish mining, and is close to the OPT policy with complete state information. We evaluate Alice’s profitable threshold and revenue when she uses the POMDP-based mining policy. Fig. 13 and 14 compare the profitable thresholds of BSM, MDP-based (OPT) and POMDP-based (POMDP) mining strategies when N = 2, 3 and α2 increases from 0.285 to 0.42. An interesting finding is that both OPT and POMDP strategies have much smaller profitable thresholds compared with BSM. For instance, the profitable threshold is 0.02 when α2 = 0.34 and N = 2, and is 0.08 when α2 = 0.3 and N = 3. The reason is the following. When Bob is playing BSM and Bob’s Hash power is much larger than Alice’s, Alice will choose to mine honestly. If there is a fork between Bob and Henry, Alice insists on mining on the public chain that contains her own blocks. Alice’s policy is equivalent to decreasing γ2 in the situation of a single attacker. Therefore, Bob will find it difficult to gain more revenues, and Alice as well as Henry benefits from Bob’s losses. In addition, with the increase of Bob’s Hash power, Alice’s profitable threshold becomes higher. A crosscomparison between Fig. 13 and 14 shows that a larger N 13 Fig. 22. The revenue for Alice when α = 0.3, Fig. 23. The revenue for Alice when α = 0.35, Fig. 24. The revenue for Alice when α = 0.4, N = 2 with different γ. N = 2 with different γ. N = 2 with different γ. Fig. 25. The revenue for Alice when α = 0.3, Fig. 26. The revenue for Alice when α = 0.35, Fig. 27. The revenue for Alice when α = 0.4, N = 3 with different γ. N = 3 with different γ. N = 3 with different γ. Fig. 28. The revenue for Alice when α = 0.3, Fig. 29. The revenue for Alice when α = 0.35, Fig. 30. The revenue for Alice when α = 0.4, N = 2 with different θ. N = 2 with different θ. N = 2 with different θ. Fig. 31. The revenue for Alice when α = 0.3, Fig. 32. The revenue for Alice when α = 0.35, Fig. 33. The revenue for Alice when α = 0.4, N = 3 with different θ. N = 3 with different θ. N = 3 with different θ. makes Alice hard to compete with Bob if α1 ≤ α2 . When Bob’s Hash power is 0.36, Alice’s profitable threshold is 0.08 when N = 2 and is about 21.6% when N = 3. The POMDP-based policy can improve Alice’s revenue under different situations. We first direct our attention towards examining the influence of miners’ Hash power on the profitability of POMDP-based policy. Setting γ1 = γ1 = 1/2 and θ1 = θ2 = 1/3, Fig. 16∼18 plot Alice’s revenues using Honest mining, BSM, OPT and POMDP-based mining at N = 2; Fig. 19∼21 show those revenues with N = 3. In each set of experiments, we fix Bob’s Hash power (α2 ) and change Alice’s Hash power (α1 ) from 0.20 to 0.40. As for the POMDP-based policy, three cases are considered, in which the probability of generating a block is 0.9, 0.8 or 0.5 at each time slot. One can easily observe that the POMDP-based policy has comparable revenues with the honest and the OPT mining policies when Alice’s Hash power is relatively small, such as the case of α1 = 0.25 in Fig. 17 and Fig. 18. While the basic selfish mining seems very “stubborn”, causing Alice’s revenue much lower than the honest mining in this situation. The revenue of the POMDP-based policy is significantly higher than that of BSM and honest mining when Alice’s hash power is large, such as the case of α1 = 0.4 in Fig. 16. Subsequently, we explore the potential benefits of various γ and θ. Fig. 22 ∼ Fig. 27 plots Alice’s revenue using honest mining, BSM, OPT and POMDP-based mining at N = 2 and N = 3 with different γ; Fig. 28 ∼ Fig. 33 show those revenues with different θ. In each set of experiments, we fix the two 0.40 0.225 0.35 0.220 0.220 0.225 0.230 0.30 0.25 Absolute revenue Relative revenue 0.25 0.30 0.35 0.40 0.45 Hash power 0.45 0.230 0.40 0.225 0.220 0.220 0.35 0.30 0.25 0.235 0.225 0.230 Absolute revenue Relative revenue 0.25 0.30 0.35 0.40 0.45 Hash power 60 50 40 30 20 10 0 Profitable period for Alice 0.45 0.230 0.235 Revenue after 1000 periods Revenue after 100 periods 14 Theoretical result α1 = α2 Theoretical result for one attacker Simulation result α1 = α2 Simulation result for one attacker 0.24 0.30 0.36 0.42 0.48 Alice's Hashrate Fig. 34. The theoretical relative revenue and ab- Fig. 35. The theoretical relative revenue and ab- Fig. 36. Profitable time and Hash power. solute revenue when α1 = α2 after 100 periods. solute revenue when α1 = α2 after 1000 periods. attackers’ Hash power (α1 = α2 ) to 0.3, 0.35 and 0.4. We then vary γ1 = γ2 = γ from 0.2 to 1 with θ1 = θ2 = 1/3. At the same time, we also explore the situations that θ1 = θ2 = θ is from 0.1 to 0.5 with γ1 = γ2 = 1/2. Evidently, it can be observed that the POMDP-based policy has demonstrated superior performance compared with both BSM and honest mining across all scenarios. Meanwhile, as the parameters γ and θ exhibit an increase, a corresponding increase in Alice’s revenue of POMDP-based policy is discernible. In conclusion, the PODMP-based policy generates equal or higher revenues than honest mining and basic selfish mining, and approaches the performance of the OPT policy. The designed online algorithm can effectively calculate the optimal revenue of MOP . We compare the number of iterations required to execute the binary search algorithm in [5] and Algorithm 1. The reduction ratios are summarized in Table I. Under different Hash power combinations and different action slots, the efficiency of Algorithm 1 is significantly higher. When N = 2, α1 = 0.3, α2 = 0.3 and p = 0.5, we can save 66.6% of the computing time. It takes a long time to simulate half a million times to obtain the revenue for each ρ. The improvement of our search algorithm plays a significant role in solving MP O rapidly. C. Selfish Mining on Multiple Difficulty Adjustment Periods Observation 5. A selfish miner obtains a smaller absolute revenue than that of honest mining during the first difficulty adjustment period regardless of his Hash power. However, he might gain profit after a number of periods that is related to the selfish miners’ Hash power. Fig. 15 shows the relative revenue and absolute revenue of attackers with the same Hashrate 33% and N = 4 in each DAA period. The relative revenue and absolute revenue are equal within the allowable range of error. Therefore, the relative revenue can play the same role as absolute revenue in representing benefit. Fig. 34 and Fig. 35 show the theoretical relative revenue and absolute revenue after 100 and 1000 DAA periods when α1 = α2 and N = 4. It can be observed that the absolute revenue is always less than the relative revenue, but the difference is small. When α1 = 0.22, the relative revenue is 0.2217 and the absolute revenue is 0.2209. This difference decreases to 0.0001 after 1000 periods. That means the difference between relative revenue and absolute revenue decreases with the increase of the attack time. TABLE I T HE TIME REDUCTION RATIO OF A LG . 1 COMPARED TO TRADITIONAL ALGORITHM . p EFF IMP 0.9 53.3% α1 = 0.3, α2 = 0.3 0.8 46.6% N =2 0.5 66.6% 80% α1 = 0.35, α2 = 0.35 0.9 0.8 60% N =2 0.5 46.7% 0.9 60% α1 = 0.28, α2 = 0.28 0.8 46.7% N =3 0.5 46.7% 0.9 73.3% α1 = 0.3, α2 = 0.3 0.8 46.7% N =3 0.5 40% As Eq. (36) shows, when Alice has more Hash powers, she can get illegal revenue earlier. However, if both of the two attackers have a large Hashrate, they will benefit late. Fig. 36 shows the simulation results and the theoretical results of profitable time under different Hash powers with symmetric attackers. The horizontal axis represents the attack’s Hash power and the ordinate represents the attackers’ profitable time, also the curves are theoretical results and the dots are simulation results. The selfish mining is profitable after 51 DAA rounds (i.e. 714 days in Bitcoin) if the Hash power of selfish miners are both 22% (slightly higher than the profitable threshold). This delay decreases to 5 rounds (i.e. 70 days in Bitcoin) as their Hash power accrues to 33%, which is still very long. As the attackers gained more computing power, Alice’s main competitor became Bob rather than Henry. The benefit time begins to increase for Alice and Bob. When there is only one attacker and she has 25.5% Hashrates, the attacker will obtain extra revenue after 26 rounds (about 364 days). It shows that when attackers’ Hashrate is relatively small, it takes a rather long period to gain profit. When the two attackers all have large Hashrates, it also takes a long period to obtain extra revenue. That means in the real system, it is a little bit hard to perform attack. If the global Hashrate increases, we can also use this formula to calculate when to stop the attack before we can benefit the most. Hash power VII. RELATED WORK Selfish mining attack is one of the core challenges of blockchain consensus that has been extensively studied in the 15 past several years. We hereby describe the recent advances in selfish mining policies and their analytical or experimental performance. One attacker. Since the advent of [3], there have been many studies on different forms of selfish mining attacks. Nayak et al. [4] proposed the stubborn mining which improves about 13.94% revenues compared with the basic selfish mining attack. A more intelligent selfish mining strategy has been proposed in [5] based on the Markov Decision Process and it decreases the profitable threshold to 23.21%. Tao et al. [6] described semi-selfish mining attack on the basis of selfish mining based on hidden Markov decision process, which not only ensures the benefit of the attack, but also reduces the forking rate. Negy et al. [30] introduced intermittent selfish mining and showed that the intermittent selfish miner above 37% hash power earns more coins per time unit even when γ = 0. Davidson et al. [31] simulated the profitability of selfish mining under several difficulty adjustment algorithms used in popular cryptocurrencies. Selfish mining attack takes on different properties in the Ethereum system because of the uncle block. Grunspan and Ritz introduced the selfish mining attack in Ethereum and found that Ethereum is more vulnerable to selfish mining than Bitcoin [32] [33]. Zhang et al. studied the selfish mining, double-spending and featherforking in different blockchain systems based on MDP. They obtained that no PoW protocol achieves ideal chain quality or is resistant against all three attacks [34]. At the same time, selfish mining attack can also be combined with other attacks to achieve greater benefits. Gervais et al. devised optimal strategies for double-spending and selfish mining while taking into account real world constraints such as network propagation, different block sizes, block generation intervals, information propagation mechanism, and the impact of eclipse attacks [11]. Kwon et al. [8] proposed FAW attack which combines the selfish mining attack and withholding attack. The reward for an FAW attacker is no less than that for a BWH attacker. Gao et al. extended the work of Kwon et al. proposing the power adjusting withholding attack (PAW) and bribery selfish mining attack (BSW) [10]. They showed PAW could avoid the “miner’s dilemma” in BWH attack and BSW increases revenues by 10% compared with traditional SM [35]. However, BSW will introduce “venal miner’s dilemma”. To avoid the “venal miner’s dilemma”, Yang et al. proposed the IPBSM attack which assumed all attackers take the optimal bribery selfish mining [36]. Multiple attackers. The participation of multiple attackers in the system will greatly change the benefits of selfish mining attacks. Ruan et al. simulated the multiple strategic miners employing strategies other than honest mining and extended the attackers’ strategy by proposing a new strategy set publishn [12]. Their results show a lower profitable threshold in the presence of multiple attackers compared to that with a single attacker. Xia et al. explored the impact of multiple miners and propagation delay on selfish mining [37]. They used simulations to show that the large selfish miner with more mining power can obtain extra revenue while other smaller attackers cannot when there are multiple selfish miners. Charlie et al. proposed SquirRL which is a framework for using deep reinforcement learning to analyze attacks on blockchain incentive mechanisms. The revenue of SquirRL is greater than that of the Markov decision attackers when there are multiple selfish mining attackers [15]. Francisco et al. proposed semi-selfish mining when there are two attackers and modeled this attack [38]. They focused on the simplest selfish mining strategy in which selfish miners never maintain private chains of length greater than 2. Their result shows the Nash equilibrium under different Hash powers and the threshold for each policy based on the game theory. Also, they modeled the situation when the number of attackers is more than two. However, they do not get the closed-form result. Further research is undertaken to explore the correlation between the number of attackers and the security of the system. Sebastian et al. proposed the simulation results that the profitable threshold decreases in proportion to the number of selfish miners [39]. Meanwhile, they found the existence of Nash equilibria where many miners use selfish mining strategy and gain extra revenue simultaneously. Zhang et al. simulated the situation when there were multiple selfish mining attackers in the system [14]. It shows there are scenarios where it is enough to have 12% mining power to benefit from selfish mining but also that having more than seven selfish miners which benefit simultaneously is highly unlikely. Azimy et al. designed a Bitcoin network simulator and used it to simulate different configurations of miners [40]. Their finding shows that in almost all of the configurations, with the presence of a more powerful selfish miner, selfish mining actually decreases the revenue of the weaker selfish miners and also helps the stronger selfish miner. VIII. C ONCLUSION In this paper, we first study how the existence of multiple misbehaving miners influences the profitability of basic selfish mining. By establishing the Markov chain model to describe the action of attackers and honest miners, we can obtain the minimum profitable threshold is symmetric 21.48%, which decreases as the number of symmetric attackers increases. If there are two asymmetric selfish mining attackers in the system, the profitable threshold of one attacker decreases first and then increases with the increase of the other attacker’s Hash power. We validate this in both models and experiments. We next investigate the revenues of the attackers when one executes the basic selfish mining and the other implements the strategic mining. A new mining strategy is designed for the miners with incomplete information based on POMDP. We can obtain revenue by the new strategy no less than honest mining and basic selfish mining. Considering the difficulty adjustment, we model the transient process and acquire the closed-form solution of the profitable time. It can be discovered that the profitable time is large when the attacker’s Hash power is low. Moreover, there is a negative correlation between the profitable time and the attackers’ mining power. R EFERENCES [1] S. Nakamoto. “Bitcoin: A peer-to-peer electronic cash system” , 2008. [2] https://decrypt.co/35373/how-long-does-it-take-to-mine-a-bitcoin. 16 [3] I. Eyal and E. G. Sirer, “Majority is not enough: bitcoin mining is vulnerable,” Commun. ACM, vol. 61, no. 7, pp. 95–102, Jun. 2018. [4] K. Nayak, et al., “Stubborn Mining: Generalizing Selfish Mining and Combining with an Eclipse Attack,” in 2016 IEEE European Symposium on Security and Privacy, Saarbrucken, Mar. 2016, pp. 305–320. [5] A. Sapirshtein, Y. Sompolinsky, A. Zohar. “Optimal selfish mining strategies in bitcoin”. International Conference on Financial Cryptography and Data Security, pp. 515-532, 2016. [6] T. Li, et al., “Semi-selfish mining based on hidden Markov decision process,” Int J Intell Syst, vol. 36, no. 7, pp. 3596–3612, Jul. 2021. [7] C. Grunspan and R. Pérez-Marco, “On profitability of selfish mining”, arXiv, 2019. [8] Y. Kwon, et al., “Be Selfish and Avoid Dilemmas: Fork After Withholding (FAW) Attacks on Bitcoin,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, Dallas Texas USA, Oct. 2017, pp. 195–209. [9] X. Dong, F. Wu, A. Faree, D. Guo, Y. Shen, and J. Ma, “Selfholding: A combined attack model using selfish mining with block withholding attack,” Computers & Security, vol. 87, p. 101584, Nov. 2019. [10] S. Gao, et al., “Power Adjusting and Bribery Racing: Novel Mining Attacks in the Bitcoin System,” in Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, London United Kingdom, Nov. 2019, pp. 833–850. [11] A. Gervais, et al., “On the Security and Performance of Proof of Work Blockchains,” in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna Austria, Oct. 2016, pp. 3–16. [12] Q.H. Liu, N. Ruan, et al. “On the Strategy and Behavior of Bitcoin Mining with N-attackers”. Proc. of the Asia Conference on Computer and Communications Security, pp. 357-368, 2018. [13] Q. Bai, X. Zhou, X. Wang, Y. Xu, X. Wang, and Q. Kong, “A Deep Dive Into Blockchain Selfish Mining”, in ICC 2019 - 2019 IEEE International Conference on Communications (ICC), Shanghai, China, 2019, pp. 1–6. [14] S. Zhang, et al., “Analysing the Benefit of Selfish Mining with Multiple Players,” in 2020 IEEE International Conference on Blockchain (Blockchain), Rhodes Island, Greece, Nov. 2020, pp. 36–44. [15] C. Hou et al., “SquirRL: Automating Attack Analysis on Blockchain Incentive Mechanisms with Deep Reinforcement Learning,” presented at the Network and Distributed System Security Symposium, Virtual, 2021. [16] Q. Wang, T. Xia, D. Wang, Y. Ren, G. Miao, and K.-K. R. Choo, “SDoS: Selfish Mining-Based Denial-of-Service Attack,” IEEE Trans.Inform.Forensic Secur., vol. 17, pp. 3335–3349, 2022. [17] D. Fullmer and A. S. Morse, “Analysis of Difficulty Control in Bitcoin and Proof-of-Work Blockchains,” in 2018 IEEE Conference on Decision and Control (CDC), Miami Beach, FL, Dec. 2018, pp. 5988–5992. [18] A. Papoulis, S. U. Pillai. Probability, random variables, and stochastic processes[M]. Tata McGraw-Hill Education, 2002. [19] Meyn, Sean P., and Richard L. Tweedie. Markov chains and stochastic stability. Springer Science & Business Media, 2012. [20] Karkus, Peter, et al., “QMDP-Net: deep learning for planning under partial observability.” Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017. [21] P. Ashok,et al., “Value Iteration for Long-Run Average Reward in Markov Decision Processes,” in Computer Aided Verification, vol. 10426, Springer International Publishing, 2017, pp. 201–221. [22] S. Jiang and J. Wu, “Bitcoin Mining with Transaction Fees: A Game on the Block Size,” in 2019 IEEE International Conference on Blockchain (Blockchain), Atlanta, GA, USA, Jul. 2019, pp. 107–115. [23] Littman et al., ”Learning policies for partially observable environments: scaling up”. In Proceedings of the 12th International Conference on Machine Learning (ICML-95), pp. 362–370. [24] S. Ross, et al., “Online Planning Algorithms for POMDPs”, jair, vol. 32, pp. 663–704, Jul. 2008. [25] Puterman, Martin L. Markov decision processes: discrete stochastic dynamic programming. John Wiley Sons, 2014. [26] https://tradeblock.com/bitcoin/historical. [27] Ross, Stephane, and Brahim Chaib-Draa. “AEMS: An anytime online ´ search algorithm for approximate policy refinement in large POMDPs.” IJCAI. 2007, pp. 2592-2598. [28] https://www.cloudlab.us/ [29] https://www.blockchain.com/explorer [30] K. A. Negy, et al., “Selfish Mining Re-Examined,” in Financial Cryptography and Data Security, vol. 12059, J. Bonneau and N. Heninger, Eds. Cham: Springer International Publishing, 2020, pp. 61–78. [31] Davidson, Michael, and Tyler Diamond. “On the Profitability of Selfish Mining Against Multiple Difficulty Adjustment Algorithms.” IACR Cryptol. ePrint Arch. 2020 (2020): 94. [32] C. Grunspan and R. Perez-Marco, “Selfish Mining in Ethereum,” in Mathematical Research for Blockchain Economy, 2020, pp. 65–90. [33] F. Ritz and A. Zugenmaier, “The Impact of Uncle Rewards on Selfish Mining in Ethereum,” in 2018 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW), London, Apr. 2018, pp. 50–57. [34] R. Zhang and B. Preneel, “Lay Down the Common Metrics: Evaluating Proof-of-Work Consensus Protocols’ Security,” in 2019 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, pp. 175–192. [35] I. Eyal, “The Miner’s Dilemma,” in 2015 IEEE Symposium on Security and Privacy, San Jose, CA, May 2015, pp. 89–103. [36] G. Yang, et al., “IPBSM: An optimal bribery selfish mining in the presence of intelligent and pure attackers,” Int J Intell Syst, vol. 35, no. 11, pp. 1735–1748, Nov. 2020. [37] Q. Xia et al., “The Impact Analysis of Multiple Miners and Propagation Delay on Selfish Mining,” in IEEE 45th Annual Computers, Software, and Applications Conference, Madrid, Spain, Jul. 2021, pp. 694–703. [38] F. J. Marmolejo-Cossı́o, et al., “Competing (Semi-)Selfish Miners in Bitcoin,” in Proceedings of the 1st ACM Conference on Advances in Financial Technologies, Zurich Switzerland, Oct. 2019, pp. 89–109. [39] T. Leelavimolsilp, et al., ”On the Preliminary Investigation of Selfish Mining Strategy with Multiple Selfish Miners,” arxiv, 2018. [40] H. Azimy and A. Ghorbani, “Competitive Selfish Mining,” in 2019 17th International Conference on Privacy, Security and Trust (PST), Fredericton, NB, Canada, Aug. 2019, pp. 1–8. Qianlan Bai is a Ph.D. candidate in computer science at Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, 200433, China. Her research interests include economic analysis and security analysis about blockchain. Yuedong Xu is a tenured professor in the School of Information Science and Technology, Fudan University, China. He received PhD from The Chinese University of Hong Kong, M.S. degree from Huazhong University of Science and Technology and B.S. degree from Anhui University. From late 2009 to 2012, he was a postdoc with INRIA Sophia Antipolis and Universite d’Avignon, France. His areas of interest include performance evaluation, optimization, security, data analytics and economic analysis of communication networks and mobile computing. Nianyi Liu is an undergraduate student majored in Intelligence Science and Technology in the School of Information Science and Technology, Fudan University, China. Xin Wang born in 1973, received his B.S. degree in Infor-mation Theory and MS Degree in Communication and Elec-tronic Systems from Xidian University, China in 1994 and 1997, respectively. He received his Ph.D. degree in Computer Science from Shizuoka University, Japan, in 2002. He is cur-rently a professor in Fudan University, Shanghai, China. His research interests include quality of network service, next-generation network architecture, mobile Internet and network coding. He is a Distinguished Member of CCF and a member of IEEE. 17 A PPENDIX A. Tie-breaking with three public chains For the situation that each of Alice and Bob hides one private block, they will publish their private chains instantly after Henry finds a new block. As shown in Fig. 37, there exist three competing public chains. Alice will mine after A1 and Bob will mine after B1 for sure; Henry is not aware of which chain is maliciously forked so that he may mine on each public chain. There are also five possible situations. The risk-avoiding release, together with two tie-breaking solutions, constitutes all the dynamics of private and public chains. together, we define a Markov state as l1 l2 hµ11 hµ22hµ33 that is also applicable to the situation with N > 4. We hereby present a concrete example of state transition. The related states in Fig. 38 are marked in blue, and their transitions are illustrated in Fig. 39 separately. At stage 1 , Alice has mined three blocks stealthily so that the system state 000 000 jumps from 0 to 30000 000 through 10000 and 20000 . At stage 2 , Henry mines a valid block H1 and publishes it to the public chain immediately. The system state then jumps from 30000 000 to 3 30011 . Bob has mined three blocks after H at stage and 1 011 the system state moves to 33011 . So far, neither Alice nor Bob 011 will release their private blocks. At stage 4 , Henry discovers a new block H2 that triggers the release action of Alice. After Alice publishes all her blocks to obsolete Henry’s chain, Bob finds that the public chain is catching up. As a consequence, Bob publishes all his blocks and wins the competition finally, i.e. the system state returning to the starting position. In this round, Bob receives three block rewards and Henry receives one block reward. According to the transitive probability, the revenue for each miner can be represented as: Fig. 37. Tie-breaking case with three public chains. π000 = 1/(1 + α1 + α2 + α1 α3 + α12 + 2α2 α1 + α22 + α13 + α2 α3 + 3α2 α12 + 2α1 α2 α3 + 3α1 α22 + α23 + α13 α3 + 4α13 α2 + 6α12 α22 + 4α1 α23 + α23 α3 + 5α13 α2 α3 + 10α13 α22 B. Scaling to N > 2 We next model the revenues of all miners when each private chain can hide more than one block. Especially, as the maximum number of private blocks for one attacker is no less than four (i.e. N =4), the “chain reaction” occurs and the resulting finite state machine becomes very much complicated. A state should include not only the lengths of all chains, but also the interleaving of blocks on them. We confine our study to three miners: Alice, Bob and Henry, and investigate the case N = 4 without loss of generality. The main difficulty hindering the mathematical analysis is that Alice and Bob have different beliefs in the starting position of the current mining round. Besides, the blocks in the winning chain may belong to Henry and Alice/Bob so that we need to memorize them in order to compute their revenues. In contrast, both Alice and Bob always have the common starting position in the racing without chain reaction (i.e. state (0, 0, 0) in Fig.38). The state transitions with N = 4 are expressed in Fig. 38 where each state consists of eight parameters. The current mining round starts at the leftmost node where no miner has discovered a block. The notation h1 indicates the distance between the starting position believed by Alice and the real starting position. Similarly, h2 and h3 indicate these distances of Bob and Henry. We record h1 , h2 and h3 because the blocks between the real and the believed starting positions influence which chain will prevail finally and how the revenues are calculated. The notation l1 (resp. l2 ) represents the number of unreleased blocks at Alice’s (resp. Bob’s) private chain. µ1 denotes the number of Henry’s blocks between the real starting position and the Alice-believed starting position. µ2 and µ3 are defined for Bob and Henry in the same way. Combined them + 6α12 α22 α3 + 10α12 α23 + 5α1 α23 α3 + α13 α22 α3 + 20α13 α23 + α13 α22 α32 + 22α13 α23 α3 + α14 α23 α3 + 20α13 α23 α32 + α12 α23 α3 + α12 α23 α32 + α13 α24 α3 ); (38) Ra =π000 · [4α14 (1 + αh ) + 3α13 αh2 + 16α14 α2 + 4α12 αh + 40α14 α22 (1 + 2α2 ) + α1 α2 αh (1 + γ1 + 2θ1 αh ) + 10α12 α2 αh + 20α13 α2 αh (3α2 + α1 ) + 15α13 α2 αh2 + 4α14 α22 αh (1 + αh ) + 4α14 α23 αh2 (β1 + 20) + 5α15 α23 αh + 4α14 α23 αh (α2 + 21) + 3α13 α24 αh2 β1 + α1 αh2 γ1 + 12α12 α22 αh2 β1 + α12 α22 αh3 β1 (3α1 + 2α2 ) + 6α13 α23 αh2 (10αh β1 + 1)]; (39) Rb =π000 [4α24 (1 + αh ) + 3α23 αh2 + 16α1 α24 + 4α22 αh + 40α12 α24 (1 + 2α1 ) + α1 α2 αh (1 + γ2 + 2θ2 αh ) + 10α1 α22 αh + 20α1 α23 αh (3α1 + α2 ) + 15α1 α23 αh2 + 4α12 α24 αh (1 + αh ) + 4α13 α24 αh2 (β2 + 20) + 5α13 α25 αh + 4α13 α24 αh (α1 + 21) + 3α14 α23 αh2 β2 + α2 αh2 γ2 + 12α12 α22 αh2 β2 + α12 α22 αh3 β2 (2α1 + 3α2 ) + 6α13 α23 αh2 (10αh β2 + 1)]; (40) Rh =π000 [α1 αh2 (2 − γ1 ) + α2 αh2 (2 − γ2 ) + α12 α23 αh3 (2β1+β2 ) + 2α1 α2 αh2 (2 − θ1 − θ2 ) + α12 α22 αh2 (6 + 4α1 α2 ) + α13 α22 αh3 (β1 + 2β2 ) + α1 α2 αh (2 − γ1 − γ2 ) + α13 α23 αh (α1 + α2 ) + α13 α24 αh2 (2β1 + β2 ) + αh + α14 α23 αh2 (β1+2β2 )+20α13 α23 αh3+2α14 α24 αh ]; (41) β1 = γ1 /(γ1 + γ2 ) β2 = γ2 /(γ1 + γ2 ). (42) 18 31011 011 α2 30011 011 αh 00101 001 α1 10000 000 11000 000 000 α2 α2 α1 00111 001 01000 000 02000 000 α1 αh α1 12000 000 α2 00011 001 α2 32000 000 α1 22000 000 13000 000 00221 001 αh 33000 000 33011 011 αh 33001 001 00332 002 α1 23000 000 α1 α1 23101 101 33101 101 αh α1 03101 001 αh α2 32011 011 αh α1 03000 000 00441 011 α1 α2 αh α2 α2 00332 012 αh 31000 000 α2 αh α2 α2 α1 21000 000 α1 α2 α1 αh α2 20000 000 α2 30011 001 α1 αh αh 30000 000 α2 00332 102 00441 101 03101 101 α1 13101 101 Fig. 38. State machine with N = 4. Fig. 39. A path sample with N = 4. 0.42 0.40 0.38 0.36 0.34 0.32 0.30 0.29 0.28 0.27 0.26 0.25 0.24 0.23 2 4 α1 = 0.27 α1 = 0.29 α1 = 0.31 α1 = 0.33 Bob's threshold Bob's threshold α1 = 0.37 α1 = 0.39 α1 = 0.41 α1 = 0.43 6 8 N 10 12 14 0 100 200 N 300 400 Fig. 40. Alice’s Revenue with α1 > max(α2 , αh ). Fig. 41. Simulated threshold for Bob when α1 > Fig. 42. Simulated threshold for Bob when αh > max(α1 , α2 ). max(α2 , αh ). R̂i = Ri , ∀i ∈ {a, b, h}. Ra + Rb + Rh (43) The cases with N > 4 can be analyzed in the same way. We validate in our experiments that the relative revenue tends to converge when N ≥ 4. If N is too large, the repeated chain-reactions will occur, which aggravates the system instability and increases the possibility of detecting selfish mining attacks. C. Profitable threshold under asymmetric attackers Observation 6. If αh < max(α1 , α2 ), the revenue of the attacker with more Hash power among three miners will increase as N increases. If αh > max(α1 , α2 ), the revenue of the attacker with more Hash power will increase first and then remain stable as N increases. The common benefit Hash power region of two attackers increases at first and then decreases with the increase of N . We explore the revenue of the attackers with more Hash power than the honest miner under different N when αh < max(α1 , α2 ). In Fig. 40, we show Alice’s revenue of Alice at four situations: the Hash powers of Alice, Bob and Henry are (0.45, 0.25, 0.3), (0.4, 0.3, 0.3), (0.35, 0.25, 0.4) and (0.3, 0.3, 0.4) that are labeled as situations 1, 2, 3 and 4. Situations 1 and 2 manifest that the revenue of the attacker with more Hash power than others increases as N increases. The attacker can obtain more than 90% of the revenue, achieving the similar effect as the 51% attack, even though she does not have 51% of Hash power. Situation 3 and 4 show that Alice’s revenue converges as N increases if Henry has the largest Hash power. The revenue of Alice converges to 0.524 in situation 3 and 0.357 in situation 4 with N = 200. This implies that limiting the attacker’s Hash power will prevent the attacker from obtaining too many revenues when N is large. We then investigate Bob’s profitable threshold when Alice has a different Hash power and N is large. Fig. 41 shows Bob’s thresholds under different N when Alice has the largest Hash power, i.e. α1 > max{α2 , αh }. Bob’s profitable threshold decreases first and then increases as N increases. That means even if Bob can obtain extra revenue when N is small, he can not obtain extra revenue when N is large enough. Under this circumstance, Alice’s revenue can obtain far more than her Hash power proportion, and her attack becomes meaningless because this system can not attract other miners even other selfish mining attackers. Fig. 42 shows Bob’s profitable threshold will decrease first and then converge as N increases when Henry has more Hash power than Alice and Bob. This suggests that limiting the attacker’s Hash power also encourages other miners including other selfish mining attackers to continue mining even if N is large. According 19 Fig. 43. Profitable regions of both selish mining attackers with N = 2, 3, 4, 5, 7, 8, 15, 25, 35, ∞. Fig. 44. Optimal revenue for Alice when N = 2. Fig. 45. Optimal revenue for Alice when N = 3. to these analyses, the attacker will not hide too many blocks even if he has more Hash power than others. in a larger common profitable region. When N is large, the stronger attacker is inclined to dominating the selfish mining. If Alice’s Hash power is larger than Bob’s, Bob will find it difficult to compete with Alice so that Bob’s profitable threshold is getting higher. If Alice’s Hash power is the largest, it is similar to 51% attack. Bob’s selfish mining is profitable only when Alice’s and Bob’s Hash powers are sufficiently close, causing the common profitable region to shrink to a line segment. We now analyze the interplay between Alice’s and Bob’s profitable thresholds. Fig. 43 shows the profitable regions for each attacker under different N . The blue part indicates that neither attacker can gain additional revenue if they perform the BSM attack, and the red part indicates that both attackers can gain additional revenue through selfish mining. The green (resp. orange) part represents the situation that only Alice (resp. Bob) can obtain extra revenue. The intersection of four regions is actually the profitable threshold with symmetric Hash powers of Alice and Bob that decreases over N and converges gradually. The common profitable region (in red) first expands and then shrinks as N increases. The reason is the following. A large red region basically says that both Alice and Bob are profitable with BSM even if their Hash powers are asymmetric to some extent. When N is very small, Alice and Bob can hide only a couple of blocks so that their ability of wasting Henry’s Hash power is restrained. With the increases of N , their selfish mining ability becomes more powerful, and thus could have more chances of obsoleting the public chain even if each of them has a smaller Hash power than Henry. Meanwhile, given the restriction of N , both Alice and Bob may receive a certain amount of extra revenues, resulting D. MDP-based mining experiment We will compute Alice’s optimal policy and the corresponding revenue of MDP-based (OPT) mining. Let the error parameter = 0.00001 and the execution number ξ0 = 500000. Fig. 44 illustrates the optimal revenue obtained by Alice when Alice’s Hash power is α1 ∈ {0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45}, Bob’s Hash power is α2 ∈ {0.30, 0.35, 0.40} and N = 2. The maximum length of the longest public chain is set as (N + 1). One can observe that Alice’s optimal mining and honest mining yield the same revenue when her mining power is relatively small, e.g. α1 = 0.1, α2 = 0.40 and αh = 0.50. Under these situations, the optimal mining policy is exactly the honest mining, while BSM underperforms the honest mining significantly. On the contrary, when Alice’s Hash power is 20 large, e.g., α1 = 0.40, α2 = 0.40 and αh = 0.20, the optimal mining policy results in a higher revenue than the basic selfish mining, and the basic selfish mining is better than the honest mining. Fig. 45 shows a set of similar experiments except for N = 3. When α1 = 0.30, α2 = 0.30 and αh = 0.4, Alice’s optimal mining policy obviously outperforms both BSM and the honest mining. We describe a concrete example of optimal mining policy at a few representative states for simplicity. Table II shows Alice’s optimal strategy with α1 =0.2, α2 =0.4 and αh =0.4. Alice has less Hash power than Bob and Henry. She chooses to adopt the public chain if the length of her chain is shorter than that of Bob or Henry. If the length of the chain (i.e. l1 + h1 ) she is mining on is equal to that of the public chain (i.e. h3 ), and is able to “fork”, she will release l1 private blocks to seize the chance of winning. If both Alice and Bob hold a private block, she will choose to hide the private block and continue to mine after her own block. By performing the optimal policy, Alice can earn an additional 0.08% revenue, even though Bob has 40% Hash power. TABLE II O PTIMAL POLICY FOR (α1 = 0.2, α2 = 0.4, N = 2). state (l2 + h2 ) > (l1 + h1 ) h3 > (l1 + h1 ) (l1 + h1 ) = (l2 + h2 ) = h3, f ork = r l1 = l2 = 1, h3 = 0 l1 = 1, l2 = 0 (l1 + h1 ) > (l2 + h2 ) action adopt adopt l1 0 l1 l1 21 TABLE III A DESCRIPTION OF THE TRANSITION AND REWARD MATRICES P AND R IN THE DECISION PROBLEM M . loc! = 2 adopt loc = 2 r, action 6= (h3 − h1 ) h1 + action ≤ h3 ir r, action > 0 h1 + action = h3 l2 + h2 − h3 > 2 f13 , action = 0 h3 < h1 + action < l2 + h2 − 1 h1 + action = l2 + h2 − 1 h1 + action = l2 + h2 h1 + action > l2 + h2 loc! = 2 adopt loc = 2 h1 + action ≤ h3 r, action 6= (h3 − h1 ) ir h1 + action = h3 r, action > 0 f13 , action = 0 l2 + h2 − h3 = 2 h1 + action = l2 + h2 − 1 h1 + action = l2 + h2 h1 + action > l2 + h2 adopt h1 + action ≤ h3 l2 + h2 − h3 = 1 h1 + action = l2 + h2 h1 + action > l2 + h2 loc = 2, loc = 3 α1 α2 αh α1 α2 αh α1 α2 αh α1 α2 αh γ1 αh (1 − γ1 ) α1 α2 αh α1 α2 αh α1 α2 αh β2 αh β1 α1 α2 αh α1 α2 αh α1 α2 αh α1 α2 αh α1 α2 αh α1 α2 αh α1 α2 αh β2 αh β1 α1 α2 αh α1 α2 αh α1 α2 αh α1 α2 αh β2 αh β1 α1 α2 αh (1, ir, 1, l2 , h3 , h2 , h3 , µ3 , µ2 , µ3 ) (1, ir, 0, l2 + 1, h3 , h2 , h3 , µ3 , µ2 , µ3 ) (1, r, 0, l2 , h3 , h2 , h3 + 1, µ3 , µ2 , µ3 + 1) (1, ir, 1, l2 , h3 − h2 , 0, h3 − h2 , h3 − h2 , 0, h3 − h2 ) (1, ir, 0, l2 + 1, h3 − h2 , 0, h3 − h2 , h3 − h2 , 0, h3 − h2 ) (1, r, 0, l2 , h3 − h2 , 0, h3 − h2 + 1, h3 − h2 , 0, h3 − h2 + 1) (loc, ir, l1 − action + 1, l2 , h1 + action, h2 , h3 , µ1 , µ2 , µ3 ) (loc, ir, l1 − action, l2 + 1, h1 + action, h2 , h3 , µ1 , µ2 , µ3 ) (loc, r, l1 − action, l2 , h1 + action, h2 , h3 + 1, µ1 , µ2 , µ3 + 1) (loc, f13 , l1 − action + 1, l2 , h3 , h2 , h3 , µ1 , µ2 , µ3 ) (loc, f13 , l1 − action, l2 + 1, h3 , h2 , h3 , µ1 , µ2 , µ3 ) (1, r, l1 − action, l2 , h3 , h2 , h3 + 1, µ1 , µ2 , µ1 + 1) (loc, r, l1 − action, l2 , h3 , h2 , h3 + 1, µ1 , µ2 , µ3 + 1) (1, ir, l1 − action + 1, l2 , h1 + action, h2 , h1 + action, µ1 , µ2 , µ1 ) (1, ir, l1 − action, l2 + 1, h1 + action, h2 , h1 + action, µ1 , µ2 , µ1 ) (1, r, l1 − action, l2 , h1 + action, h2 , h1 + action + 1, µ1 , µ2 , µ1 + 1) (2, r, l1 + 1 − action, 0, h1 + action, h2 + l2 , h2 + l2 , µ1 , µ2 , µ2 ) (2, r, l1 − action, 1, h1 + action, h2 + l2 , h2 + l2 , µ1 , µ2 , µ2 ) (2, r, l1 − action, 0, h1 + action, h2 + l2 + 1, h2 + l2 + 1, µ1 , µ2 + 1, µ2 + 1) (loc, f12 , l1 − action + 1, 0, h1 + action, h2 + l2 , h2 + l2 , µ1 , µ2 , µ3 ) (2, r, l1 − action, 0, h1 + action, h2 + l2 + 1, h2 + l2 + 1, µ1 , µ2 , µ2 ) (2, r, l1 − action, 0, h1 + action, h2 + l2 + 1, h2 + l2 + 1, µ1 , µ2 + 1, µ2 + 1) (2, r, l1 − action, 0, 0, 1, 1, 0, 1, 1) (3, ir, l1 − action + 1, 0, 0, 0, 0, 0, 0, 0) (3, ir, l1 − action, 1, 0, 0, 0, 0, 0, 0) (3, r, l1 − action, 0, 0, 1, 1, 0, 1, 1) (1, ir, 1, l2 , h3 , h2 , h3 , µ3 , µ2 , µ3 ) (1, ir, 0, l2 + 1, h3 , h2 , h3 , µ3 , µ2 , µ3 ) (2, r, 0, 0, h3 , h2 + l2 , h2 + l2 , µ3 , µ2 , µ2 ) (1, ir, 1, l2 , h3 − h2 , 0, h3 − h2 , h3 − h2 , 0, h3 − h2 ) (1, ir, 0, l2 + 1, h3 − h2 , 0, h3 − h2 , h3 − h2 , 0, h3 − h2 ) (2, r, 0, 0, h3 − h2 , l2 , l2 , h3 − h2 , 0, 0) (loc, ir, l1 − action + 1, l2 , h1 + action, h2 , h3 , µ1 , µ2 , µ3 ) (loc, ir, l1 − action, l2 + 1, h1 + action, h2 , h3 , µ1 , µ2 , µ3 ) (2, r, l1 − action, 0, h1 + action, h2 + l2 , h2 + l2 , µ1 , µ2 , µ2 ) (loc, f13 , l1 − action + 1, l2 , h3 , h2 , h3 , µ1 , µ2 , µ3 ) (loc, f13 , l1 − action, l2 + 1, h3 , h2 , h3 , µ1 , µ2 , µ3 ) (2, r, l1 − action, 0, h1 + action, h2 + l2 , h2 + l2 , µ1 , µ2 , µ2 ) (2, r, l1 + 1 − action, 0, h1 + action, h2 + l2 , h2 + l2 , µ1 , µ2 , µ2 ) (2, r, l1 − action, 1, h1 + action, h2 + l2 , h2 + l2 , µ1 , µ2 , µ2 ) (2, r, l1 − action, 0, h1 + action, h2 + l2 + 1, h2 + l2 + 1, µ1 , µ2 + 1, µ2 + 1) (loc, f12 , l1 − action + 1, 0, h1 + action, h2 + l2 , h2 + l2 , µ1 , µ2 , µ3 ) (2, r, l1 − action, 0, h1 + action, h2 + l2 + 1, h2 + l2 + 1, µ1 , µ2 , µ2 ) (2, r, l1 − action, 0, h1 + action, h2 + l2 + 1, h2 + l2 + 1, µ1 , µ2 + 1, µ2 + 1) (2, r, l1 − action, 0, 0, 1, 1, 0, 1, 1) (3, ir, l1 − action + 1, 0, 0, 0, 0, 0, 0, 0) (3, ir, l1 − action, 1, 0, 0, 0, 0, 0, 0) (3, r, l1 − action, 0, 0, 1, 1, 0, 1, 1) (1, ir, 1, l2 , h3 − h2 , 0, h3 − h2 , h3 − h2 , 0, h3 − h2 ) (1, ir, 0, l2 + 1, h3 − h2 , 0, h3 − h2 , h3 − h2 , 0, h3 − h2 ) (loc, f23 , 0, 0, h3 − h2 , h3 − h2 + 1, h3 − h2 + 1, 0, 0, 1) (loc, ir, l1 − action + 1, l2 , h1 + action, h2 , h3 , µ1 , µ2 , µ3 ) (loc, ir, l1 − action, l2 + 1, h1 + action, h2 , h3 , µ1 , µ2 , µ3 ) (loc, f23 , l1 − action, 0, h1 + action, h3 + 1, h3 + 1, µ1 , µ2 , µ3 + 1) (loc, f12 , l1 − action + 1, 0, h1 + action, h2 + l2 , h2 + l2 , µ1 , µ2 , µ3 ) (2, r, l1 − action, 0, h1 + action, h2 + l2 + 1, h2 + l2 + 1, µ1 , µ2 , µ2 ) (2, r, l1 − action, 0, h1 + action, h2 + l2 + 1, h2 + l2 + 1, µ1 , µ2 + 1, µ2 + 1) (2, r, l1 − action, 0, 0, 1, 1, 0, 1, 1) (3, ir, l1 − action + 1, 0, 0, 0, 0, 0, 0, 0) (3, ir, l1 − action, 1, 0, 0, 0, 0, 0, 0) (3, r, l1 − action, 0, 0, 1, 1, 0, 1, 1) (0, 0, 0) (0, h2 − µ2 , µ2 ) (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0) (h1 + action − µ1 , 0, µ1 ) (h1 + action − µ1 , 0, µ1 ) (0, 0, 0) (0, h2 − µ2 , µ2 ) (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0) (h1 + action − µ1 , 0, µ1 ) (h1 + action − µ1 , 0, µ1 ) (0, h2 − µ2 , µ2 ) (0, 0, 0) (0, 0, 0) (h1 + action − µ1 , 0, µ1 ) (h1 + action − µ1 , 0, µ1 ) 22 f23 , f123 , loc = 2 f23 , f123 , loc! = 2 adopt (r, ir, loc = 2), f12 (r, ir, loc! = 2), f13 f12 f13 l2 + h2 = h3 action = 0 f23 f123 r, ir f23 action = h3 − h1 > 0 r, h2 = µ2 r, h2 ! = µ2 h1 + action > l2 + h2 α1 γ2 α1 (1 − γ2 ) α2 αh γ2 αh (1 − γ2 ) α1 γ2 α1 (1 − γ2 ) α2 αh γ2 αh (1 − γ2 ) α1 α2 αh α1 α2 αh α1 α2 αh β1 αh β2 α1 α2 γ1 α2 (1 − γ1 ) αh γ1 αh (1 − γ1 ) α1 α2 αh γ2 αh (1 − γ2 ) α1 α2 αh θ1 αh θ2 αh (1 − θ1 − θ2 ) α1 α2 αh α1 α2 αh θ1 αh θ2 αh (1 − θ1 − θ2 ) α1 α2 γ1 α2 (1 − γ1 ) αh γ1 αh (1 − γ1 ) α1 α2 αh β1 αh β2 α1 α2 αh (1, ir, 0, 0, 0, 0, 0, 0, 0, 0) (3, ir, 0, 0, 0, 0, 0, 0, 0, 0) (2, ir, 0, 0, 0, 0, 0, 0, 0, 0) (2, r, 0, 0, 0, 0, 0, 0, 0, 0) (3, r, 0, 0, 0, 0, 0, 0, 0, 0) (1, ir, 0, 0, 0, 0, 0, 0, 0, 0) (3, ir, 0, 0, 0, 0, 0, 0, 0, 0) (2, ir, 0, 0, 0, 0, 0, 0, 0, 0) (2, r, 0, 0, 0, 0, 0, 0, 0, 0) (3, r, 0, 0, 0, 0, 0, 0, 0, 0) (3, ir, 1, 0, 0, 0, 0, 0, 0, 0) (3, ir, 0, 1, 0, 0, 0, 0, 0, 0) (3, r, 0, 0, 0, 1, 1, 0, 1, 1) (3, ir, 1, 0, 0, 0, 0, 0, 0, 0) (3, ir, 0, 1, 0, 0, 0, 0, 0, 0) (3, r, 0, 0, 0, 1, 1, 0, 1, 1) (loc, f ork, l1 + 1, l2 , h1 , h2 , h3 , µ1 , µ2 , µ3 ) (2, r, l1 , l2 , h1 , h2 + 1, h2 + 1, µ1 , µ2 , µ2 ) (2, r, l1 , 0, 0, 1, 1, 0, 1, 1) (2, r, l1 , l2 , h1 , h2 + 1, h2 + 1, µ1 , µ2 + 1, µ2 + 1) (1oc, f ork, l1 + 1, l2 , h1 , h2 , h3 , µ1 , µ2 , µ3 ) (2, r, l1 , l2 , 0, 1, 1, 0, 0, 0) (2, r, l1 , l2 , h1 , h3 + 1, h3 + 1, µ1 , µ3 , µ3 ) (2, r, l1 , 0, 0, 1, 1, 0, 1, 1) (2, r, l1 , l2 , h1 , h3 + 1, h3 + 1, µ1 , µ3 + 1, µ3 + 1) (loc, f ork, l1 + 1, l2 , h1 , h2 , h3 , µ1 , µ2 , µ3 ) (2, r, l1 , 0, h1 , h2 + 1, h2 + 1, µ1 , µ2 , µ2 ) (2, r, l1 , 0, h1 , h2 + 1, h2 + 1, µ1 , µ2 + 1, µ2 + 1) (2, r, l1 , 0, h1 , h2 + 1, h2 + 1, µ1 , µ3 + 1, µ3 + 1) (loc, f ork, l1 + 1, l2 , h1 , h2 , h3 , µ1 , µ2 , µ3 ) (2, r, l1 , 0, h1 , h2 + 1, h2 + 1, µ1 , µ2 , µ2 ) (2, r, l1 , 0, 0, 1, 1, 0, 1, 1) (2, r, l1 , 0, h1 , h3 + 1, h3 + 1, µ1 , µ2 + 1, µ2 + 1) (2, r, l1 , 0, h1 , h3 + 1, h3 + 1, µ1 , µ3 + 1, µ3 + 1) (loc, ir, l1 + 1, l2 , h1 , h2 , h3 , µ1 , µ2 , µ3 ) (loc, ir, l1 , l2 + 1, h1 , h2 , h3 , µ1 , µ2 , µ3 ) (loc, ir, l1 , l2 , h1 , h2 + 1, h3 + 1, µ1 , µ2 + 1, µ3 + 1) (loc, f123 , l1 − action + 1, l2 , h1 + action, h2 , h3 , µ1 , µ2 , µ3 ) (2, r, l1 − action, 0, h1 + action, h2 + 1, h3 + 1, µ1 , µ2 , µ2 ) (2, r, l1 − action, l2 , 0, 1, 1, 0, 1, 1) (2, r, l1 − action, l2 , h1 + action, h2 + 1, h3 + 1, µ1 , µ2 + 1, µ2 + 1) (2, r, l1 − action, l2 , h1 + action, h2 + 1, h3 + 1, µ1 , µ3 + 1, µ3 + 1) (loc, f13 , l1 − action + 1, l2 , h1 + action, h2 , h3 , µ1 , µ2 , µ3 ) (2, r, l1 − action, l2 , 0, 1, 1, 0, 0, 0) (2, r, l1 − action, l2 , h1 + action, h2 + 1, h3 + 1, µ1 , µ3 , µ3 ) (2, r, l1 − action, l2 , 0, 1, 1, 0, 1, 1) (2, r, l1 − action, l2 , h1 + action, h2 + 1, h3 + 1, µ1 , µ3 + 1, µ3 + 1) (loc, f12 , l1 − action + 1, l2 , h1 + action, h2 , h3 , µ1 , µ2 , µ3 ) (2, r, l1 − action, l2 , h1 + action, h2 + 1, h3 + 1, µ1 , µ2 , µ2 ) (2, r, l1 − action, l2 , 0, 1, 1, 0, 1, 1) (2, r, l1 − action, l2 , h1 + action, h2 + 1, h3 + 1, µ1 , µ3 + 1, µ3 + 1) (3, ir, l1 − action + 1, 0, 0, 0, 0, 0, 0, 0) (3, ir, l1 − action, 1, 0, 0, 0, 0, 0, 0) (3, r, l1 − action, 0, 0, 1, 1, 0, 1, 1) (1, h2 − µ2 , µ2 ) (1, h3 − µ3 , µ3 ) (0, h2 + 1 − µ2 , µ2 ) (0, h2 − µ2 , µ2 + 1) (0, h3 − µ3 , µ3 + 1) (1, h2 − µ2 , µ2 ) (1 + h3 − µ3 , 0, µ3 ) (0, h2 + 1 − µ2 , µ2 ) (0, h2 − µ2 , µ2 + 1) (h3 − µ3 , 0, µ3 + 1) (0, h2 − µ2 , µ2 ) (h3 − µ3 , 0, µ3 ) (0, 0, 0) (0, 0, 0) (h1 − µ1 , 0, µ1 ) (0, 0, 0) (0, 0, 0) (h1 − µ1 , 0, µ1 ) (0, 0, 0) (h1 − µ1 , 0, µ1 ) (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0) (h1 − µ1 , 0, µ1 ) (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0) (h1 − µ1 , 0, µ1 ) (0, 0, 0) (0, 0, 0) (0, 0, 0) (h1 + action − µ1 , 0, µ1 ) (0, 0, 0) (h1 + action − µ1 , 0, µ1 ) (0, 0, 0) (0, 0, 0) (0, 0, 0) (h1 + action − µ1 , 0, µ1 ) (0, 0, 0) (h1 + action − µ1 , 0, µ1 ) 23 TABLE IV A DESCRIPTION OF THE TRANSITION AND REWARD MATRICES P AND R IN THE DECISION PROBLEM MP O . loc! = 2 adopt loc = 2 h1 + action ≤ h3 r, action 6= (h3 − h1 ) ir h1 + action = h3 l2 + h2 − h3 > 2 r, action > 0 f13 , action = 0 h3 < h1 + action < l2 + h2 − 1 h1 + action = l2 + h2 − 1 h1 + action = l2 + h2 h1 + action > l2 + h2 loc! = 2 adopt loc = 2 r, action 6= (h3 − h1 ) h1 + action ≤ h3 ir r, action > 0 h1 + action = h3 l2 + h2 − h3 = 2 f13 , action = 0 h1 + action = l2 + h2 − 1 h1 + action = l2 + h2 h1 + action > l2 + h2 adopt h1 + action ≤ h3 l2 + h2 − h3 = 1 h1 + action = l2 + h2 h1 + action > l2 + h2 loc = 2, loc = 3 (1 − p) α1 p α2 p αh p (1 − p) α1 p α2 p αh p 1−p α1 p α2 p αh p (1 − p) α1 p α2 p αh pγ1 αh p(1 − γ1 ) (1 − p) α1 p α2 p αh p (1 − p) α1 p α2 p αh p 1−p α1 p α2 p αh pβ2 αh pβ1 (1 − p) α1 p α2 p αh p (1 − p) α1 p α2 p αh p (1 − p) α1 p α2 p αh p (1 − p) α1 p α2 p αh p (1 − p) α1 p α2 p αh p (1 − p) α1 p α2 p αh p (1 − p) α1 p α2 p αh pβ2 αh pβ1 (1 − p) α1 p α2 p αh p (1 − p) α1 p α2 p αh p (1 − p) α1 p α2 p αh p (1 − p) α1 p α2 p αh pβ2 αh pβ1 (1 − p) α1 p α2 p αh p (loc, ir, 0, l2 , h3 , h2 , h3 , µ3 , µ2 , µ3 ) (1, ir, 1, l2 , h3 , h2 , h3 , µ3 , µ2 , µ3 ) (1, ir, 0, l2 + 1, h3 , h2 , h3 , µ3 , µ2 , µ3 ) (1, r, 0, l2 , h3 , h2 , h3 + 1, µ3 , µ2 , µ3 + 1) (3, ir, 0, l2 , h3 − h2 , 0, h3 − h2 , h3 − h2 , 0, h3 − h2 ) (1, ir, 1, l2 , h3 − h2 , 0, h3 − h2 , h3 − h2 , 0, h3 − h2 ) (1, ir, 0, l2 + 1, h3 − h2 , 0, h3 − h2 , h3 − h2 , 0, h3 − h2 ) (1, r, 0, l2 , h3 − h2 , 0, h3 − h2 + 1, h3 − h2 , 0, h3 − h2 + 1) (loc, ir, l1 − action, l2 , h1 + action, h2 , h3 , µ1 , µ2 , µ3 ) (loc, ir, l1 − action + 1, l2 , h1 + action, h2 , h3 , µ1 , µ2 , µ3 ) (loc, ir, l1 − action, l2 + 1, h1 + action, h2 , h3 , µ1 , µ2 , µ3 ) (loc, r, l1 − action, l2 , h1 + action, h2 , h3 + 1, µ1 , µ2 , µ3 + 1) (loc, f13 , l1 − action, l2 , h3 , h2 , h3 , µ1 , µ2 , µ3 ) (loc, f13 , l1 − action + 1, l2 , h3 , h2 , h3 , µ1 , µ2 , µ3 ) (loc, f13 , l1 − action, l2 + 1, h3 , h2 , h3 , µ1 , µ2 , µ3 ) (1, r, l1 − action, l2 , h3 , h2 , h3 + 1, µ1 , µ2 , µ1 + 1) (loc, r, l1 − action, l2 , h3 , h2 , h3 + 1, µ1 , µ2 , µ3 + 1) (1, ir, l1 − action, l2 , h1 + action, h2 , h1 + action, µ1 , µ2 , µ1 ) (1, ir, l1 − action + 1, l2 , h1 + action, h2 , h1 + action, µ1 , µ2 , µ1 ) (1, ir, l1 − action, l2 + 1, h1 + action, h2 , h1 + action, µ1 , µ2 , µ1 ) (1, r, l1 − action, l2 , h1 + action, h2 , h1 + action + 1, µ1 , µ2 , µ1 + 1) (2, ir, l1 − action, 0, h1 + action, h2 + l2 , h2 + l2 , µ1 , µ2 , µ2 ) (2, r, l1 + 1 − action, 0, h1 + action, h2 + l2 , h2 + l2 , µ1 , µ2 , µ2 ) (2, r, l1 − action, 1, h1 + action, h2 + l2 , h2 + l2 , µ1 , µ2 , µ2 ) (2, r, l1 − action, 0, h1 + action, h2 + l2 + 1, h2 + l2 + 1, µ1 , µ2 + 1, µ2 + 1) (loc, f12 , l1 − action, 0, h1 + action, h2 + l2 , h2 + l2 , µ1 , µ2 , µ3 ) (loc, f12 , l1 − action + 1, 0, h1 + action, h2 + l2 , h2 + l2 , µ1 , µ2 , µ3 ) (2, r, l1 − action, 0, h1 + action, h2 + l2 + 1, h2 + l2 + 1, µ1 , µ2 , µ2 ) (2, r, l1 − action, 0, h1 + action, h2 + l2 + 1, h2 + l2 + 1, µ1 , µ2 + 1, µ2 + 1) (2, r, l1 − action, 0, 0, 1, 1, 0, 1, 1) (3, ir, l1 − action, 0, 0, 0, 0, 0, 0, 0) (3, ir, l1 − action + 1, 0, 0, 0, 0, 0, 0, 0) (3, ir, l1 − action, 1, 0, 0, 0, 0, 0, 0) (3, r, l1 − action, 0, 0, 1, 1, 0, 1, 1) (loc, ir, 0, l2 , h3 , h2 , h3 , µ3 , µ2 , µ3 ) (1, ir, 1, l2 , h3 , h2 , h3 , µ3 , µ2 , µ3 ) (1, ir, 0, l2 + 1, h3 , h2 , h3 , µ3 , µ2 , µ3 ) (2, r, 0, 0, h3 , h2 + l2 , h2 + l2 , µ3 , µ2 , µ2 ) (3, ir, 0, l2 , h3 − h2 , 0, h3 − h2 , h3 − h2 , 0, h3 − h2 ) (1, ir, 1, l2 , h3 − h2 , 0, h3 − h2 , h3 − h2 , 0, h3 − h2 ) (1, ir, 0, l2 + 1, h3 − h2 , 0, h3 − h2 , h3 − h2 , 0, h3 − h2 ) (2, r, 0, 0, h3 − h2 , l2 , l2 , h3 − h2 , 0, 0) (loc, ir, l1 − action, l2 , h1 + action, h2 , h3 , µ1 , µ2 , µ3 ) (loc, ir, l1 − action + 1, l2 , h1 + action, h2 , h3 , µ1 , µ2 , µ3 ) (loc, ir, l1 − action, l2 + 1, h1 + action, h2 , h3 , µ1 , µ2 , µ3 ) (2, r, l1 − action, 0, h1 + action, h2 + l2 , h2 + l2 , µ1 , µ2 , µ2 ) (loc, f13 , l1 − action, l2 , h3 , h2 , h3 , µ1 , µ2 , µ3 ) (loc, f13 , l1 − action + 1, l2 , h3 , h2 , h3 , µ1 , µ2 , µ3 ) (loc, f13 , l1 − action, l2 + 1, h3 , h2 , h3 , µ1 , µ2 , µ3 ) (2, r, l1 − action, 0, h1 + action, h2 + l2 , h2 + l2 , µ1 , µ2 , µ2 ) (2, ir, l1 − action, 0, h1 + action, h2 + l2 , h2 + l2 , µ1 , µ2 , µ2 ) (2, r, l1 + 1 − action, 0, h1 + action, h2 + l2 , h2 + l2 , µ1 , µ2 , µ2 ) (2, r, l1 − action, 1, h1 + action, h2 + l2 , h2 + l2 , µ1 , µ2 , µ2 ) (2, r, l1 − action, 0, h1 + action, h2 + l2 + 1, h2 + l2 + 1, µ1 , µ2 + 1, µ2 + 1) (loc, f12 , l1 − action, 0, h1 + action, h2 + l2 , h2 + l2 , µ1 , µ2 , µ3 ) (loc, f12 , l1 − action + 1, 0, h1 + action, h2 + l2 , h2 + l2 , µ1 , µ2 , µ3 ) (2, r, l1 − action, 0, h1 + action, h2 + l2 + 1, h2 + l2 + 1, µ1 , µ2 , µ2 ) (2, r, l1 − action, 0, h1 + action, h2 + l2 + 1, h2 + l2 + 1, µ1 , µ2 + 1, µ2 + 1) (2, r, l1 − action, 0, 0, 1, 1, 0, 1, 1) (3, ir, l1 − action, 0, 0, 0, 0, 0, 0, 0) (3, ir, l1 − action + 1, 0, 0, 0, 0, 0, 0, 0) (3, ir, l1 − action, 1, 0, 0, 0, 0, 0, 0) (3, r, l1 − action, 0, 0, 1, 1, 0, 1, 1) (1, ir, 0, l2 , h3 − h2 , 0, h3 − h2 , h3 − h2 , 0, h3 − h2 ) (1, ir, 1, l2 , h3 − h2 , 0, h3 − h2 , h3 − h2 , 0, h3 − h2 ) (1, ir, 0, l2 + 1, h3 − h2 , 0, h3 − h2 , h3 − h2 , 0, h3 − h2 ) (loc, f23 , 0, 0, h3 − h2 , h3 − h2 + 1, h3 − h2 + 1, 0, 0, 1) (loc, ir, l1 − action, l2 , h1 + action, h2 , h3 , µ1 , µ2 , µ3 ) (loc, ir, l1 − action + 1, l2 , h1 + action, h2 , h3 , µ1 , µ2 , µ3 ) (loc, ir, l1 − action, l2 + 1, h1 + action, h2 , h3 , µ1 , µ2 , µ3 ) (loc, f23 , l1 − action, 0, h1 + action, h3 + 1, h3 + 1, µ1 , µ2 , µ3 + 1) (loc, f12 , l1 − action, 0, h1 + action, h2 + l2 , h2 + l2 , µ1 , µ2 , µ3 ) (loc, f12 , l1 − action + 1, 0, h1 + action, h2 + l2 , h2 + l2 , µ1 , µ2 , µ3 ) (2, r, l1 − action, 0, h1 + action, h2 + l2 + 1, h2 + l2 + 1, µ1 , µ2 , µ2 ) (2, r, l1 − action, 0, h1 + action, h2 + l2 + 1, h2 + l2 + 1, µ1 , µ2 + 1, µ2 + 1) (2, r, l1 − action, 0, 0, 1, 1, 0, 1, 1) (3, ir, l1 − action, 0, 0, 0, 0, 0, 0, 0) (3, ir, l1 − action + 1, 0, 0, 0, 0, 0, 0, 0) (3, ir, l1 − action, 1, 0, 0, 0, 0, 0, 0) (3, r, l1 − action, 0, 0, 1, 1, 0, 1, 1) (0, 0, 0) (0, h2 − µ2 , µ2 ) (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0) (h1 + action − µ1 , 0, µ1 ) (h1 + action − µ1 , 0, µ1 ) (0, 0, 0) (0, h2 − µ2 , µ2 ) (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0) (h1 + action − µ1 , 0, µ1 ) (h1 + action − µ1 , 0, µ1 ) (0, h2 − µ2 , µ2 ) (0, 0, 0) (0, 0, 0) (h1 + action − µ1 , 0, µ1 ) (h1 + action − µ1 , 0, µ1 ) 24 f23 , f123 , loc = 2 adopt f23 , f123 , loc! = 2 (r, ir, loc = 2), f12 (r, ir, loc! = 2), f13 f12 f13 l2 + h2 = h3 action = 0 f23 f123 r, ir f23 action = h3 − h1 > 0 r, h2 = µ2 r, h2 ! = µ2 h1 + action > l2 + h2 (1 − p) α1 pγ2 α1 p(1 − γ2 ) α2 p αh pγ2 αh p(1 − γ2 ) (1 − p) α1 pγ2 α1 p(1 − γ2 ) α2 p αh pγ2 αh p(1 − γ2 ) (1 − p) α1 p α2 p αh p (1 − p) α1 p α2 p αh p (1 − p) α1 p α2 p αh pβ1 αh pβ2 (1 − p) α1 p α2 pγ1 α2 p(1 − γ1 ) αh pγ1 αh p(1 − γ1 ) (1 − p) α1 p α2 p αh pγ2 αh p(1 − γ2 ) (1 − p) α1 p α2 p αh pθ1 αh pθ2 αh p(1 − θ1 − θ2 ) (1 − p) α1 p α2 p αh p (1 − p) α1 p α2 p αh pθ1 αh pθ2 αh p(1 − θ1 − θ2 ) (1 − p) α1 p α2 pγ1 α2 p(1 − γ1 ) αh pγ1 αh p(1 − γ1 ) (1 − p) α1 p α2 p αh pβ1 αh pβ2 (1 − p) α1 p α2 p αh p (loc, f23 , 0, l2 , 0, µ3 − µ2 , µ3 − µ2 , 0, 0, µ3 − µ2 ) (1, ir, 0, 0, 0, 0, 0, 0, 0, 0) (3, ir, 0, 0, 0, 0, 0, 0, 0, 0) (2, ir, 0, 0, 0, 0, 0, 0, 0, 0) (2, r, 0, 0, 0, 0, 0, 0, 0, 0) (3, r, 0, 0, 0, 0, 0, 0, 0, 0) (loc, f23 , 0, l2 , h3 , h2 , h3 , µ3 , µ2 , µ3 ) (1, ir, 0, 0, 0, 0, 0, 0, 0, 0) (3, ir, 0, 0, 0, 0, 0, 0, 0, 0) (2, ir, 0, 0, 0, 0, 0, 0, 0, 0) (2, r, 0, 0, 0, 0, 0, 0, 0, 0) (3, r, 0, 0, 0, 0, 0, 0, 0, 0) (3, ir, 0, 0, 0, 0, 0, 0, 0, 0) (3, ir, 1, 0, 0, 0, 0, 0, 0, 0) (3, ir, 0, 1, 0, 0, 0, 0, 0, 0) (3, r, 0, 0, 0, 1, 1, 0, 1, 1) (3, ir, 0, 0, 0, 0, 0, 0, 0, 0) (3, ir, 1, 0, 0, 0, 0, 0, 0, 0) (3, ir, 0, 1, 0, 0, 0, 0, 0, 0) (3, r, 0, 0, 0, 1, 1, 0, 1, 1) (loc, f ork, l1 , l2 , h1 , h2 , h3 , µ1 , µ2 , µ3 ) (loc, f ork, l1 + 1, l2 , h1 , h2 , h3 , µ1 , µ2 , µ3 ) (2, r, l1 , l2 , h1 , h2 + 1, h2 + 1, µ1 , µ2 , µ2 ) (2, r, l1 , l2 , 0, 1, 1, 0, 1, 1) (2, r, l1 , l2 , h1 , h2 + 1, h3 + 1, µ1 , µ2 + 1, µ2 + 1) (1oc, f ork, l1 , l2 , h1 , h2 , h3 , µ1 , µ2 , µ3 ) (1oc, f ork, l1 + 1, l2 , h1 , h2 , h3 , µ1 , µ2 , µ3 ) (2, r, l1 , l2 , 0, 1, 1, 0, 0, 0) (2, r, l1 , l2 , h1 , h3 + 1, h3 + 1, µ1 , µ3 , µ3 ) (2, r, l1 , 0, 0, 1, 1, 0, 1, 1) (2, r, l1 , l2 , h1 , h3 + 1, h3 + 1, µ1 , µ3 + 1, µ3 + 1) (loc, f ork, l1 , l2 , h1 , h2 , h3 , µ1 , µ2 , µ3 ) (loc, f ork, l1 + 1, l2 , h1 , h2 , h3 , µ1 , µ2 , µ3 ) (2, r, l1 , 0, h1 , h2 + 1, h2 + 1, µ1 , µ2 , µ2 ) (2, r, l1 , 0, h1 , h2 + 1, h2 + 1, µ1 , µ2 + 1, µ2 + 1) (2, r, l1 , 0, h1 , h2 + 1, h2 + 1, µ1 , µ3 + 1, µ3 + 1) (loc, f ork, l1 , l2 , h1 , h2 , h3 , µ1 , µ2 , µ3 ) (loc, f ork, l1 + 1, l2 , h1 , h2 , h3 , µ1 , µ2 , µ3 ) (2, r, l1 , 0, h1 , h2 + 1, h2 + 1, µ1 , µ2 , µ2 ) (2, r, l1 , 0, 0, 1, 1, 0, 1, 1) (2, r, l1 , 0, h1 , h3 + 1, h3 + 1, µ1 , µ2 + 1, µ2 + 1) (2, r, l1 , 0, h1 , h3 + 1, h3 + 1, µ1 , µ3 + 1, µ3 + 1) (loc, ir, l1 , l2 , h1 , h2 , h3 , µ1 , µ2 , µ3 ) (loc, ir, l1 + 1, l2 , h1 , h2 , h3 , µ1 , µ2 , µ3 ) (loc, ir, l1 , l2 + 1, h1 , h2 , h3 , µ1 , µ2 , µ3 ) (loc, ir, l1 , l2 , h1 , h2 + 1, h3 + 1, µ1 , µ2 + 1, µ3 + 1) (loc, f123 , l1 − action, l2 , h1 + action, h2 , h3 , µ1 , µ2 , µ3 ) (loc, f123 , l1 − action + 1, l2 , h1 + action, h2 , h3 , µ1 , µ2 , µ3 ) (2, r, l1 − action, 0, h1 + action, h2 + 1, h3 + 1, µ1 , µ2 , µ2 ) (2, r, l1 − action, l2 , 0, 1, 1, 0, 1, 1) (2, r, l1 − action, l2 , h1 + action, h2 + 1, h3 + 1, µ1 , µ2 + 1, µ2 + 1) (2, r, l1 − action, l2 , h1 + action, h2 + 1, h3 + 1, µ1 , µ3 + 1, µ3 + 1) (loc, f13 , l1 − action, l2 , h1 + action, h2 , h3 , µ1 , µ2 , µ3 ) (loc, f13 , l1 − action + 1, l2 , h1 + action, h2 , h3 , µ1 , µ2 , µ3 ) (2, r, l1 − action, l2 , 0, 1, 1, 0, 0, 0) (2, r, l1 − action, l2 , h1 + action, h2 + 1, h3 + 1, µ1 , µ3 , µ3 ) (2, r, l1 − action, l2 , 0, 1, 1, 0, 1, 1) (2, r, l1 − action, l2 , h1 + action, h2 + 1, h3 + 1, µ1 , µ3 + 1, µ3 + 1) (loc, f12 , l1 − action, l2 , h1 + action, h2 , h3 , µ1 , µ2 , µ3 ) (loc, f12 , l1 − action + 1, l2 , h1 + action, h2 , h3 , µ1 , µ2 , µ3 ) (2, r, l1 − action, l2 , h1 + action, h2 + 1, h3 + 1, µ1 , µ2 , µ2 ) (2, r, l1 − action, l2 , 0, 1, 1, 0, 1, 1) (2, r, l1 − action, l2 , h1 + action, h2 + 1, h3 + 1, µ1 , µ3 + 1, µ3 + 1) (3, ir, l1 − action, 0, 0, 0, 0, 0, 0, 0) (3, ir, l1 − action + 1, 0, 0, 0, 0, 0, 0, 0) (3, ir, l1 − action, 1, 0, 0, 0, 0, 0, 0) (3, r, l1 − action, 0, 0, 1, 1, 0, 1, 1) (0, h3 − µ3 , µ2 ) (1, h2 − µ2 , µ2 ) (1, h3 − µ3 , µ3 ) (0, h2 + 1 − µ2 , µ2 ) (0, h2 − µ2 , µ2 + 1) (0, h3 − µ3 , µ3 + 1) (0, 0, 0) (1, h2 − µ2 , µ2 ) (1 + h3 − µ3 , 0, µ3 ) (0, h2 + 1 − µ2 , µ2 ) (0, h2 − µ2 , 1 + µ2 ) (h3 − µ3 , 0, µ3 + 1) (0, h2 − µ2 , µ2 ) (h3 − µ3 , 0, µ3 ) (0, 0, 0) (0, 0, 0) (0, 0, 0) (h1 − µ1 , 0, µ1 ) (0, 0, 0) (0, 0, 0) (0, 0, 0) (h1 − µ1 , 0, µ1 ) (0, 0, 0) (h1 − µ1 , 0, µ1 ) (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0) (h1 − µ1 , 0, µ1 ) (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0) (h1 − µ1 , 0, µ1 ) (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0) (h1 + action − µ1 , 0, µ1 ) (0, 0, 0) (h1 + action − µ1 , 0, µ1 ) (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0) (h1 + action − µ1 , 0, µ1 ) (0, 0, 0) (h1 + action − µ1 , 0, µ1 )