Iit 3.0

Author: Steve Glanz Type: Essay

Consciousness Theory Afterlife Engineering Research Biology CIF Theory Bayesian Summary

8 1 0 2 n u J 5 2 ] C N . o i b - q [ 1 v 3 7 3 9 0 . 6 0 8 1 : v i X r a Measuring Integrated Information: Comparison of Candidate Measures in Theory and Simulation Pedro A.M. Mediano1,*, Anil K. Seth2, and Adam B. Barrett2 1Department of Computing, Imperial College, London, UK 2Sackler Centre for Consciousness Science and Department of Informatics, University of Sussex, Brighton, UK *Corresponding author: [email protected] Abstract Integrated Information Theory (IIT) is a prominent theory of consciousness that has at its centre measures that quantify the extent to which a system generates more information than the sum of its parts. While several candidate measures of integrated information (‘Φ’) now exist, little is known about how they compare, especially in terms of their behaviour on non-trivial network models. In this article we provide clear and intuitive descriptions of six distinct candidate measures. We then explore the properties of each of these measures in simulation on networks consisting of eight interacting nodes, animated with Gaussian linear autoregressive dynamics. We ﬁnd a striking diversity in the behaviour of these measures – no two measures show consistent agreement across all analyses. Further, only a subset of the measures appear to genuinely reﬂect some form of dynamical complexity, in the sense of simultaneous segregation and integration between system components. Our results help guide the operationalisation of IIT and advance the development of measures of integrated information that may have

more general applicability. 1 Introduction Since the seminal work of Tononi, Sporns and Edelman [45], and more recently, of Balduzzi and Tononi [5], there have been many valuable contributions in neuroscience towards understanding and quantifying the dynamical complexity of a wide variety of systems. A system is said to be dynamically complex if it shows a balance between two competing tendencies, namely integration, i.e. the system behaves as one; and segregation, i.e. the parts of the system behave independently. • • The notion of dynamical complexity has also been variously described as a balance between order and disorder, or between chaos and synchrony, and has been related to criticality and metastability 1 [31]. Many quantitative measures of dynamical complexity have been proposed, but a theoretically- principled, one-size-ﬁts-all measure remains elusive. A prominent framework highlighting the extent of simultaneous integration and segregation is Integrated Information Theory (IIT), which studies dynamical complexity from information-theoretic principles. Measures of integrated information attempt to quantify the extent to which the whole system is generating more information than the ‘sum of its parts’. The information to be quantiﬁed is typically the information that the current state contains about a past state (for the information integrated over time window τ , the past state to be considered is that at time τ from the present). The partitioning is done such that one considers the parts with the weakest links between them, in other words, the partition across which integrated information is computed is the ‘minimum information partition.’

There are many ways one can operationalise this concept of integrated information. Consequently, there now exists a range of distinct integrated information measures. Proponents of IIT claim that measures of integrated information potentially relate to the quantity of consciousness generated by any physical system [34]. This is however controversial, and empirical evidence of a relationship between any particular measure of integrated information and conscious- ness remains scarce [15]. Here, we do not focus on the connections of IIT to consciousness, although we do comment on the application of IIT to neural data (see Discussion). We instead consider mea- sures of integrated information more generally as useful operationalisations of notions of dynamical complexity. We have two goals. First, to provide a uniﬁed source of explanation of the principles and practical- ities of the various candidate measures of integrated information. Second, to examine the behaviour of candidate measures on non-trivial network models, in order to shed light on their comparative practical utility. In a recent related paper, Tegmark [41] developed a theoretical taxonomy of all integrated infor- mation measures that can be written as a distance between a probability distribution pertaining to the whole and that obtained from the product of probability distributions pertaining to the parts. Here we review in detail ﬁve distinct and prominent proposed measures of integrated information, including two (ψ and ΦG) that were not covered in Tegmark’s taxonomy. These are: whole-minus- sum integrated information Φ [5]; integrated stochastic interaction ˜Φ [11]; integrated synergy ψ [19]; decoder-based integrated

information Φ∗ [35]; geometric integrated information ΦG [37]. We also consider, for comparison, the measure causal density (CD) [39], which can be considered as the sum of independent information transfers in the system (without reference to a minimum information partition). This measure has previously been discussed in conjunction with integrated information measures [40, 39]. All of the measures have the potential to behave in ways which are not obvious a priori, and in a manner diﬃcult to express analytically. While some simulations of some of the measures (Φ, ˜Φ and CD) on networks have been performed [11, 39], other measures (Φ∗ and ΦG) have not previously been computed on any model consisting of more than two components. This paper provides a comparison of the full suite of measures on non-trivial network models. We consider eight-node networks with a range of diﬀerent architectures, animated with basic noisy vector autoregressive dynamics. We examine how network topology as well as coupling strength and correlation of noise inputs aﬀect each measure. We also plot the relation between each measure and the global correlation (a simple dynamical control). Based on these comparisons we discuss the extent to which each measure appears genuinely to capture the co-existence of integration and segregation central to the 2 concepts of dynamical complexity and integrated information. After covering the necessary preliminaries in Section 2, Section 3 sets out the intuition behind the measures, and summarises the mathematics behind the deﬁnition of each measure. In Section 4 we present the simulations.

Then Section 5 is the Discussion. In the Appendix, Section A.1, we derive new formulae for computing the decoder-based integrated information Φ∗ for Gaussian systems, correcting the previous formulae in Ref. [35]. Other Appendices contain further derivations of mathematical properties of the measures. 2 Notation, convention and preliminaries In this section we review the fundamental concepts needed to deﬁne and discuss the candidate In general, we will denote random variables with uppercase measures of integrated information. letters (e.g. X, Y ) and particular instantiations with the corresponding lowercase letters (e.g. x, y). Variables can be either continuous or discrete, and we assume that continuous variables can take any value in Rn and that a discrete variable X can take any value in the ﬁnite set ΩX . Whenever there is a sum involving a discrete variable X we assume the sum runs for all possible values of divides the elements of system X X (i.e. = into r non-overlapping, non-empty sub-systems (or parts), such that X = M 1 (cid:83) M 2 (cid:83) (cid:83) M r and M i (cid:84) M j = , for any i, j. We denote each variable in X as X i, and the total number of variables in X as n. When dealing with time series, time will be indexed with a subscript, e.g. Xt. the whole ΩX ). A partition M 1, M 2, . . . , M r · · · P ∅ } { Entropy H quantiﬁes the uncertainty associated

with random variable X – i.e. the higher H(X) the harder it is to make predictions about X – and is deﬁned as H(X) =: (cid:88) − x p(x) log p(x) . (1) In many scenarios, a discrete set of states is insuﬃcient to represent a process or time series. This is the case, for example, with brain recordings, which come in real-valued time series and with no R we can similarly a priori discretisation scheme. In these cases, using a continuous variable X deﬁne the diﬀerential entropy, ∈ H[p] =: (cid:90) − p(x) log p(x)dx . (2) However, diﬀerential entropy is not as interpretable and well-behaved as its discrete-variable coun- terpart. For example, diﬀerential entropy is not invariant to rescaling or other transformations on X. Moreover, it is only deﬁned if X has a density with respect to the Lebesgue measure dx; this assumption will be upheld throughout this paper. We can also deﬁne the conditional and joint entropies as H(X | Y ) =: (cid:88) p(y)H(X Y = y) | y (cid:88) = − y p(y) (cid:88) x p(x | y) log p(x y) | 3 (3) H(X, Y ) =: (cid:88) − x,y p(x, y) log p(x, y) , (4) respectively. Conditional and joint entropies can be analogously deﬁned for continuous variables by appropriately replacing sums with integrals. The Kullback-Leibler (KL) divergence quantiﬁes the dissimilarity between two probability distri- butions p and q: DKL(p (cid:107) q) =: (cid:88) x p(x) log p(x) q(x) . (5) The KL

divergence represents a notion of (non-symmetric) distance between two probability distribu- tions. It plays an important role in information geometry, which deals with the geometric structure of manifolds of probability distributions. Finally, mutual information I quantiﬁes the interdependence between two random variables X and Y . It is the KL divergence between the full joint distribution and the product of marginals, but it can also be expressed as the average reduction in uncertainty about X when Y is given: I(X; Y ) =: DKL (p(X, Y ) = H(X) + H(Y ) (cid:107) p(X)p(Y )) H(X, Y ) = H(X) H(X − Y ) . | (6) − Mutual information is symmetric in the two arguments X and Y . We make use of the following properties of mutual information: 1. I(X; Y ) = I(Y ; X), 2. I(X; Y ) 0, and ≥ 3. I(f (X); g(Y )) = I(X; Y ) for any injective functions f, g. We highlight one implication of property 3: I is upper-bounded by the entropy of both X and Y . This means that the entropy H(X) of a random variable X is the maximum amount of information X can have about any other variable Y (or another variable Y can have about X). Mutual information is deﬁned analogously for continuous variables and, unlike diﬀerential en- tropy, it retains its interpretability in the continuous case.1 Furthermore, one can track how much information a system preserves during its temporal evolution by computing the

time-delayed mutual information (TDMI) I(Xt; Xt τ ). Next, we introduce notation and several useful identities to handle Gaussian variables. Given an n-dimensional real-valued system X, we denote its covariance matrix as Σ(X)ij =: cov(X i, X j). Similarly, cross-covariance matrices are denoted as Σ(X, Y )ij =: cov(X i, Y j). We will make use of the conditional (or partial) covariance formula, − Σ(X Y ) =: Σ(X) | − Σ(X, Y )Σ(Y )− 1Σ(Y, X) . (7) 1The formal derivation of the diﬀerential entropy proceeds by considering the entropy of a discrete variable with k states, and taking the k limit. The result is the diﬀerential entropy plus a divergent term that is usually dropped and is ultimately responsible for the undesirable properties of diﬀerential entropy. In the case of I(X; Y ) the divergent terms for the various entropies involved cancel out, restoring the useful properties of its discrete counterpart [16]. → ∞ 4 For Gaussian variables, H(X) = H(X Y = y) = | I(X; Y ) = 1 2 1 2 1 2 log(det Σ(X)) + 1 2 n log(2πe) , Y )) + | (cid:19) log(det Σ(X log (cid:18) det Σ(X) Y ) det Σ(X | n log(2πe) , y , ∀ 1 2 . (8) (9) (10) All systems we deal with in this article are stationary and ergodic, so throughout the paper Σ(Xt) = Σ(Xt τ ) for any τ . − 3 Integrated information measures 3.1 Overview In this section we

review the theoretical underpinnings and practical considerations of several pro- posed measures of integrated information, and in particular how they relate to intuitions about segregation, integration and complexity. These measures are: • • • • • Whole-minus-sum integrated information, Φ; Integrated stochastic interaction, ˜Φ; Integrated synergy, ψ; Decoder-based integrated information, Φ∗; Geometric integrated information, ΦG; and Causal density, CD. • All of these measures (besides CD) have been inspired by the measure proposed by Balduzzi 2. Φ2008 was based on the information the current state and Tononi in [5], which we call Φ2008 contains about a hypothetical maximum entropy past state. In practice, this results in measures that are applicable only to discrete Markovian systems [11]. For broader applicability, it is more practical to build measures based on the ongoing spontaneous information dynamics – that is, based on p(Xt, Xt τ ) without applying a perturbation to the system. Measures are then well-deﬁned for any stochastic system (with a well-deﬁned Lebesgue measure across the states), and can be estimated for real data using empirical distributions if stationarity can be assumed. All of the measures we consider in this paper are based on a system’s spontaneous information dynamics. − Table 1 contains a brief description of each measure and a reference to the original publication that introduced it. We refer the reader to the original publications for more detailed descriptions of each measure. Table 2 contains a summary of properties of the measures considered, proven for the case in which the

system is ergodic and stationary, and the spontaneous distribution is used. 2Causal density is somewhat distinct, but is still a measure of complexity based on information dynamics between the past and current state; therefore its inclusion here will be useful. 3Although the origins of causal density go as back as [18], it hasn’t been until the last decade that it has found its way into neuroscience. The paper referenced in the table acts as a modern review of the properties and behaviour of causal density. 5 Table 1: Integrated information measures considered and original references. Measure Description Reference Φ ˜Φ ψ Φ∗ ΦG CD Information lost after splitting the system Uncertainty gained after splitting the system Synergistic predictive information between parts of the system Past state decoding accuracy lost after splitting the system Information-geometric distance to system with disconnected parts Average pairwise directed information ﬂow [5] [11] [19] [35] [37] [39]3 Table 2: Overview of properties of integrated information measures. Proofs in Appendix C. Φ ˜Φ ψ Φ∗ ΦG CD (cid:88) (cid:88) × × (cid:88) × Time-symmetric (cid:88) × (cid:88) (cid:88) (cid:88) (cid:88) Non-negative (cid:88) (cid:88) × (cid:88) (cid:88) (cid:88) Invariant to variable rescaling (cid:88) × (cid:88) (cid:88) (cid:88) (cid:88) Upper-bounded by time-delayed mutual information (cid:88) (cid:88) × × × (cid:88) Computable for arbitrary real-valued systems Closed-form expression in discrete and Gaussian systems (cid:88) (cid:88) (cid:88) × × (cid:88) 3.2 Minimum information partition Key to all measures of integrated information is the notion of splitting or partitioning the system to

quantify the eﬀect of such split on the system as a whole. In that spirit, integrated information mea- sures are deﬁned through some measure of eﬀective information, which operationalises the concept of “information beyond a partition” P and computing some form of information loss, via (for example) mutual information (Φ), conditional entropy ( ˜Φ), or decoding accuracy (Φ∗) (see Table 1). Integrated information is then the eﬀective information with respect to the partition that identiﬁes the “weakest link” in the system, i.e. the partition for which the parts are least integrated. Formally, integrated information is the eﬀective information beyond the minimum information partition (MIP), which, given an eﬀective information measure f [X; τ, . This typically involves splitting the system according to ], is deﬁned as P P PMIP = arg P min f [X; τ, P ) K( P ] , (11) P where K( ) is a normalisation coeﬃcient. In other words, the MIP is the partition across which the (normalised) eﬀective information is minimum, and integrated information is the (unnormalised) eﬀective information beyond the MIP. The purpose of the normalisation coeﬃcient is to avoid biasing the minimisation towards unbalanced bipartitions (recall that the extent of information sharing between parts is bounded by the entropy of the smaller part). Balduzzi and Tononi [5] suggest the form However, not all contributions to IIT have followed Balduzzi and Tononi’s treatment of the MIP. Of the measures listed above, Φ and ˜Φ share this partition scheme, ψ deﬁnes the MIP through

an ) = (r K( P 1) min k − H(M k t ) . (12) 6 unnormalised eﬀective information, and Φ∗, ΦG and CD are deﬁned via the atomic partition without any reference to the MIP. These diﬀerences are a confounding factor when it comes to comparing measures – it becomes diﬃcult to ascertain whether diﬀerences in behaviour of various measures are due to their deﬁnitions of eﬀective information, to their normalisation factor (or lack thereof), or to their partition schemes. We return to this discussion in Sec. 5.1. In the following we present all measures as they were introduced in their original papers (see Table 1), although it is trivial to combine diﬀerent eﬀective information measures with diﬀerent partition optimisation schemes. However, all results presented in Sec. 4 are calculated by minimising each unnormalised eﬀective information measure over even-sized bipartitions – i.e. bipartitions in which both parts have the same number of components. This is to avoid conﬂating the eﬀect of the partition scan method with the eﬀect of the integrated information measure itself. 3.3 Whole-minus-sum integrated information Φ We next turn to the diﬀerent measures of integrated information. As highlighted above, a primary diﬀerence among them is how they deﬁne the eﬀective information beyond a given partition. Since most measures were inspired by Balduzzi and Tononi’s Φ2008, we start there. M k X1 = For Φ2008, the eﬀective information ϕ2008 is given by the KL divergence between pc(X0| x) and Πkpc(M k 1 = mk)) is the

0 | conditional distribution for X0 given X1 = x under the perturbation at time 0 into all states with equal probability – i.e. given that the joint distribution is given by pce(X0, X1) =: p(X1| X0)pu(X0), where pu is the uniform (maximum entropy) distribution4. X1 = x) (and analogously pc(M k 0 | 1 = mk), where pc(X0| M k Averaging ϕ2008 over all states x, the result can be expressed as either or I(X0; X1) r (cid:88) k=1 − I(M k 0 ; M k 1 ) , H(X0| − X1) + r (cid:88) k=1 H(M k 0 | M k 1 ) . (13) (14) These two expressions are equivalent under the uniform perturbation, since they diﬀer only by a factor that vanishes if p(X0) is the uniform distribution. However, they are not equivalent if the spontaneous distribution of the system is used instead – i.e. τ , Xt) is used instead of pce(X0, X1). This means that for application to spontaneous dynamics (i.e. without perturbation) we have two alternatives that give rise to two measures that are both equally valid analogs of Φ2008. We call the ﬁrst alternative whole-minus-sum integrated information Φ (ΦE in [11]). The eﬀective information ϕ is deﬁned as the diﬀerence in time-delayed mutual information between the whole system and the parts. The eﬀective information of the system beyond a certain partition if p(Xt is − ϕ[X; τ, P ] =: I(Xt τ ; Xt) − r (cid:88) k=1 − I(M k t

− τ ; M k t ) . P (15) 4Here we follow notation from [26]. The c and e here stand respectively for cause and eﬀect. Without an initial condition, here that the uniform distribution holds at time 0, there would be no well-deﬁned probability distribution for these states. Further, Markovian dynamics are required for these probability distributions to be well-deﬁned; for non-Markovian dynamics, a longer chain of initial states would have to be speciﬁed, going beyond just that at time 0. 7 − P τ ) as how good the system is at predicting its own future or decoding its We can interpret I(Xt; Xt own past5. Then ϕ here can be seen as the loss in predictive power incurred by splitting the system according to . The details of the calculation of Φ (and the MIP) are shown in Box 3.1. Φ is often regarded as a poor measure of integrated information because it can be negative [35]. This is indeed conceptually awkward if Φ is seen as an absolute measure of integration between the parts of a system, though it is a reasonable property if Φ is interpreted as a “net synergy” measure [9] – quantifying to what extent the parts have shared or complementary information about the future state. That is, if Φ > 0 we infer that the whole is better than the parts at predicting the future (i.e., Φ > 0 is a suﬃcient condition), but a negative or zero Φ does not

imply the opposite. Therefore, from an IIT perspective a negative Φ can lead to the understandably confusing interpretation of a system having “negative integration,” but through a diﬀerent lens (net synergy) it can be more easily interpreted as (negative) overall redundancy in the evolution of the system. See Section 3.5 and Ref. [9] for further discussion on whole-minus-sum measures. Box 3.1: Calculating whole-minus-sum integrated information Φ Φ[X; τ ] = ϕ[X; τ, MIB = arg min B B ] MIB] B ϕ[X; τ, B ) K( 2 (cid:88) B τ ; M k I(M k t ) t − ϕ[X; τ, B ] = I(Xt τ ; Xt) − − k=1 ) = min (cid:8)H(M 1 t ), H(M 2 t )(cid:9) 1. For discrete variables: K( B I(Xt − τ ; Xt) = (cid:88) x,x(cid:48) p(Xt − τ = x, Xt = x(cid:48)) log 2. For continuous, linear-Gaussian variables: (cid:18) p(Xt p(Xt − τ = x, Xt = x(cid:48)) − τ = x) p(Xt = x(cid:48)) (16a) (16b) (16c) (16d) (cid:19) I(Xt τ ; Xt) = − (cid:18) 1 2 log det Σ(Xt) Xt det Σ(Xt | − (cid:19) τ ) 3. For continuous variables with an arbitrary distribution, we must resort to the nearest- neighbour methods introduced by [25]. See reference for details. 3.4 Integrated stochastic interaction ˜Φ We next consider the second alternative for Φ2008 for spontaneous information dynamics: integrated stochastic interaction ˜Φ. Also introduced in Barrett and Seth [11], this measure embodies similar 5Future and past

are equivalent because mutual information is symmetric. 8 concepts as Φ, with the main diﬀerence being that ˜Φ utilises a deﬁnition of eﬀective information in terms of an increase in uncertainty instead of in terms of a loss of information. ˜Φ is based on stochastic interaction ˜ϕ, introduced by Ay [4]. Akin to Eq. (15), we deﬁne stochastic interaction beyond partition as P ˜ϕ[X; τ, ] =: P r (cid:88) k=1 H(M k t − τ | M k t ) − H(Xt τ | − Xt) . (17) Stochastic interaction quantiﬁes to what extent uncertainty about the past is increased when the system is split in parts, compared to considering the system as a whole. The details of the calculation of ˜Φ are similar to those of Φ and are described in Box 3.2. The most notable advantage of ˜Φ over Φ as a measure of integrated information is that ˜Φ is guaranteed to be non-negative. In fact, as mentioned above ϕ and ˜ϕ are related through the equation where ˜ϕ[X; τ, P ] = ϕ[X; τ, P ] + I(M 1 t ; M 2 t ; . . . ; M r t ) , I(M 1 t ; M 2 t ; . . . ; M r t ) = r (cid:88) k=1 H(M k t ) − H(Xt) . (18) (19) − τ | Xt) measures the amount of irreversibly destroyed information, since H(Xt This measure is also linked to information destruction, as presented

in Wiesner et al. [48]. The quantity H(Xt Xt) > 0 indicates that more than one possible past trajectory of the system converged on the same present state, making the system irreversible and indicating a loss of information about the past states. From this perspective, ˜ϕ can be understood as the diﬀerence between the information that is considered destroyed when the system is observed as a whole, or split into parts. Note however that this measure is time-symmetric when applied to a stationary system; for stationary systems total instantaneous entropy does not increase with time. τ | − 9 Box 3.2: Calculating integrated stochastic interaction ˜Φ ˜Φ[X; τ ] = ˜ϕ[X; τ, MIB = arg min B B MIB] B ˜ϕ[X; τ, B ) K( ] B r (cid:88) H(M k t − τ | k=1 ) = min (cid:8)H(M 1 M k t ) H(Xt − t ), H(M 2 t )(cid:9) − Xt) τ | (20a) (20b) (20c) (20d) p(Xt τ = x, Xt = x(cid:48)) log − (cid:18) p(Xt τ = x, Xt = x(cid:48)) − p(Xt = x(cid:48)) (cid:19) ˜ϕ[X; τ, ] = B K( B 1. For discrete variables: H(Xt τ | − Xt) = (cid:88) x,x(cid:48) − 2. For continuous, linear-Gaussian variables: H(Xt τ | − Xt) = 1 2 log det Σ(Xt τ | − Xt) + 1 2 n log(2πe) 3. For continuous variables with an arbitrary distribution, we must resort to the nearest- neighbour methods introduced by [25]. See reference for

details. 3.5 Integrated synergy ψ Originally designed as a “more principled” integrated information measure [19], ψ shares some fea- tures with Φ and ˜Φ but is grounded in a diﬀerent branch of information theory, namely the Partial Information Decomposition (PID) framework, as described by Williams and Beer [49]. In the PID, the information that two (source) variables provide about a third (target) variable is decomposed into four non-negative terms as I(X, Y ; Z) = UX (X; Z) + UY (Y ; Z) + R(X, Y ; Z) + S(X, Y ; Z) , where Uα is the unique information of source α, R is the redundancy between both sources and S is their synergy. Figure 1 illustrates the involved quantities in a Venn diagram. Integrated synergy ψ is the information that the parts provide about the future of the system that is exclusively synergistic – i.e. cannot be provided by any combination of parts independently: ψ[X; τ, P where ] =: I(Xt τ ; Xt) − − max P I (M 1 t − τ , . . . , M r τ , M 2 t t − − ∪ τ ; Xt) , I (M 1 t − τ , . . . , M r t − ∪ τ ; Xt) =: S⊆{ ( − M 1,...,M r } (cid:88) 1)|S| +1I ∩ 1 ( t S − τ , . . . , τ ; Xt) , |S|t − S (21) (22) 10 I(X, Y

; Z) S R UY UX Figure 1: Venn diagram of the partial information decomposition [49]. ∩ S|S| S1, . . . , ( ; Z) denotes the redundant information sources have about target Z. and I The main problem of PID is that it is underdetermined. For example, for the case of two sources, Shannon’s information theory speciﬁes three quantities (I(X, Y ; Z), I(X; Z), I(Y ; Z)) whereas PID speciﬁes four (S, R, UX , UY ). Therefore, a complete operational deﬁnition of ψ requires a deﬁnition of redundancy from which to construct the partial information components [49]. In this sense, the main shortcoming of ψ, inherited from PID, is that there is no agreed consensus on a deﬁnition of redundancy [9, 12]. S1, . . . , S|S| Here, we take Griﬃth’s conceptual deﬁnition of ψ and we complement it with available deﬁnitions of redundancy. For the linear-Gaussian systems we will be studying in Sec. 4, we use the minimum mutual information PID presented in [9]6. Although we do not show any discrete examples here, for completeness we provide complete formulae to calculate ψ for discrete variables using Griﬃth and Koch’s redundancy measure [20]. Note that alternatives are available for both discrete and linear-Gaussian systems [38, 23, 49, 13, 24]. 6Barrett’s derivation of the MMI-PID, which follows Williams and Beer and Griﬃth and Koch’s procedure, gives this formula when the target is univariate. We generalise the formula here to the case of multivariate target in order

to render ψ computable for Gaussians. This formula leads to synergy being the extra information contributed by the weaker source given the stronger source was previously known. 11 Box 3.3: Calculating integrated synergy ψ ψ[X; τ, ] = I(Xt − P τ ; Xt) − max P I ∪ τ , . . . , M r (M 1 t t − − τ ; Xt) (23) 1. For discrete variables: (following Griﬃth and Koch’s [20] PID scheme) I ∪ τ , . . . , M r (M 1 t t − − τ ; Xt) = min q (cid:88) q(x, x(cid:48)) log (cid:19) (cid:18) q(x, x(cid:48)) q(x) q(x(cid:48)) x,x(cid:48) τ , Xt) = p(M i s.t. q(M i t t − − τ , Xt) 2. For continuous, linear-Gaussian variables: I (M 1 t − τ , . . . , M r t − ∪ τ ; Xt) = max k I(M k t − τ ; Xt) 3. For continuous variables with an arbitrary distribution: unknown. 3.6 Decoder-based integrated information Φ∗ Introduced by Oizumi et al. in Ref. [35], decoder-based integrated information Φ∗ takes a diﬀerent approach from the previous measures. In general, Φ∗ is given by (25) Φ∗[X; τ, ] =: I(Xt τ ; Xt) I ∗[X; τ, ] , P − where I ∗ is known as the mismatched decoding information, and quantiﬁes how much information can be extracted from a variable if the receiver is using a suboptimal (or mismatched ) decoding distribution

[27, 33]. This mismatched information has been used in neuroscience to quantify the contribution of neural correlations in stimulus coding [36], and can similarly be used to measure the contribution of inter-partition correlations to predictive information. − P To calculate Φ∗ we formulate a restricted model q in which the correlations between partitions are ignored, q(Xt| Xt − τ ) = (cid:89) i p(M i t | M i t − τ ) , (26) and we calculate I ∗ for the case where the sender is using the full model p as an encoder and the receiver is using the restricted model q as a decoder. The details of the calculation of Φ∗ and I ∗ are shown in Box 3.4. Unlike the previous measures shown in this section, Φ∗ does not have an interpretable formulation in terms of simpler information-theoretic functionals like entropy and mutual information. Calculating I ∗ involves a one-dimensional optimisation problem, which is straightforwardly solv- able if the optimised quantity, ˜I(β), has a closed form expression [27]. For systems with continuous variables, it is in general very hard to estimate ˜I(β). However, for continuous linear-Gaussian sys- tems and for discrete systems ˜I(β) has an analytic closed form as a function of β if the covariance or 12 joint probability table of the system are known, respectively. In Appendix A we derive the formulae. (Note the version written down in [35] is incorrect, although their simulations match our results; we checked results from our derived version

of the formulae versus results obtained from numerical integration, and conﬁrmed that our derived formulae are the correct ones.) Conveniently, in both the discrete and the linear-Gaussian case ˜I(β) is concave in β (proofs in [27] and in Appendix A, respectively), which makes the optimisation signiﬁcantly easier. Box 3.4: Calculating decoder-based integrated information Φ∗ Φ∗[X; τ, ] = I(Xt − P I ∗[X; τ, ] = max β P τ ; Xt) I ∗[X; τ, − ˜I(β; X, τ, ] P (27a) (27b) ) P 1. For discrete variables: ˜I(β; X, τ, ) = P − (cid:88) x(cid:48) p(Xt = x(cid:48)) log (cid:88) p(Xt τ = x)q(Xt = x(cid:48) − Xt τ = x)β − | + x (cid:88) x,x(cid:48) p(Xt τ = x, Xt = x(cid:48)) log q(Xt = x(cid:48) − τ = x)β Xt | − 2. For continuous, linear-Gaussian variables: (see appendix for details) ˜I(β; X, τ, ) = P 1 2 log ( | Q Σx| || ) + 1 2 tr (ΣxR) + β tr (cid:16) 1 ˜xΠx˜xΠ− Π− x | 1 x Σ˜xx (cid:17) 3. For continuous variables with an arbitrary distribution: unknown. 3.7 Geometric integrated information ΦG In [37], Oizumi et al. approach the notion of dynamical complexity via yet another formalism. Their approach is based on information geometry [2, 1]. The objects of study in information geometry are spaces of families of probability distributions, considered as diﬀerentiable (smooth) manifolds. The natural metric in information geometry is the Fisher information metric, and the

KL divergence provides a natural measure of (asymmetric) distance between probability distributions. Information geometry is the application of diﬀerential geometry to the relationships and structure of probability distributions. To quantify integrated information, Oizumi et al. [37] consider the divergence between the complete model of the system under study p(Xt τ , Xt) in which − links between the parts of the system have been severed. This is known as the M-projection of the system onto the manifold of restricted models Q = τ , Xt) and a restricted model q(Xt , and − q : q(M i τ ) = q(M i t | t | { − τ , Xt)) . q(Xt τ , Xt) DKL (p(Xt (cid:107) Xt − − M i t − τ ) } (28) ΦG[X; τ, ] =: min Q q ∈ P Key to this measure is that in considering the partitioned system, it is only the connections that are 13 cut; correlations between the parts are still allowed on the partitioned system. Although conceptually simple, ΦG is very hard to calculate compared to all other measures we consider here (see Box 3.5). There is no known closed form solution for any system, and we can only ﬁnd approximate numerical estimates for some systems. In particular, for discrete and linear-Gaussian variables we can formulate ΦG as the solution of a pure constrained multivariate optimisation problem, with the advantage that the optimisation objective is diﬀerentiable and convex [14]. Box 3.5: Calculating geometric integration ΦG

ΦG[X; τ, DKL(p q) (cid:107) ] = min q P s.t. q(M i Xt) = q(M i t+τ | M i t ) t+τ | 1. For discrete variables: numerically optimise the objective DKL(p (29a) (29b) q) subject to the (cid:107) constraints (cid:88) x,x(cid:48) q(Xt τ = x(cid:48), Xt = x) = 1 − and q(M i t | Xt τ ) = q(M i t | − M i t − τ ) i ∀ 2. For continuous, linear-Gaussian variables: numerically optimise the objective ΦG[X; τ, ] = min Σ(E)(cid:48) P 1 2 Σ(E)(cid:48)| log | Σ(E) | | , 1), and subject to the constraints where Σ(E) = Σ(Xt| Xt − Σ(E)(cid:48) = Σ(E) + (A − A(cid:48))Σ(E)(cid:48)− A(cid:48))Σ(X)(A 1)ii = 0 − A(cid:48))T and (Σ(X)(A − 3. For continuous variables with an arbitrary distribution: unknown. 3.8 Causal density Causal density (CD) is somewhat distinct from the other measures considered so far, in the sense that it is a sum of information transfers rather than a direct measure of the extent to which the whole is greater than the parts. Nevertheless, we include it here because of its relevance and use in the dynamical complexity literature. CD was originally deﬁned in terms of Granger causality [18], but here we write it in terms of Transfer Entropy (TE) which provides a more general information-theoretic deﬁnition [6]. The conditional transfer entropy from X to Y conditioned on Z is deﬁned as With this deﬁnition of TE we deﬁne CD as

the average pairwise conditioned TE between all variables TEτ (X Y Z) =: I(Xt; Yt+τ | | → Zt, Yt) . (30) 14 in X, CD[X; τ, ] =: P r(r 1 − (cid:88) i =j 1) TEτ (M i M j | → M [ij]), (31) where M [ij] is the subsystem formed by all variables in X except for those in parts M i and M j. In a practical sense, CD has many advantages. It has been thoroughly studied in theory [7] and applied in practice, with application domains ranging from complex systems to neuroscience [28, 29, 32]. Furthermore, there are oﬀ-the-shelf algorithms that calculate TE in discrete and continuous systems [8]. For details of the calculation of CD see Box 3.6. Causal density is a principled measure of dynamical complexity, as it vanishes for purely segre- gated or purely integrated systems. In a highly segregated system there is no information transfer at all, and in a highly integrated system there is no transfer from one variable to another beyond the rest of the system [39]. Furthermore, CD is non-negative and upper-bounded by the total time- delayed mutual information (proof in Appendix B), therefore satisfying what other authors consider an essential requirement for a measure of integrated information [37]. Box 3.6: Calculating causal density CD CD[X; τ, ] = P r(r 1 − (cid:88) i =j 1) TEτ (M i M j | → M [ij]) (32) 1. For discrete variables: T Eτ (X i (cid:88) x,x(cid:48)

→ (cid:16) p X [ij]) = X j | X j t+τ = x(cid:48) j, Xt = x (cid:17)  log  (cid:16) p (cid:16) p X j (cid:17) j t+τ = x(cid:48) X j Xt = x | t = xj, X [ij] j | t = x[ij]   (cid:17) X j t+τ = x(cid:48) 2. For continuous, linear-Gaussian variables: T Eτ (X i X j | → X [ij]) = 1 2 log   det Σ (cid:16) X j X j t+τ | (cid:16) X j t ⊕ Xt t+τ | det Σ X [ij] t (cid:17) (cid:17)   3. For continuous variables with an arbitrary distribution, we must resort to the nearest- neighbour methods introduced by [25]. See reference for details. 3.9 Other measures As already mentioned, all the measures reviewed here (besides CD) were inspired by the Φ2008 measure, which arose from the version of IIT laid out in Ref. [5]. The most recent version of IIT [34] is conceptually distinct, and the associated “Φ-3.0” is consequently diﬀerent to the measures 15 (cid:54) (cid:54) we consider here. The consideration of perturbation of the system, as well as all of its subsets, in both the past and the future renders Φ-3.0 considerably more computationally expensive than other Φ measures. We do not here attempt to consider the construction of an analogue of Φ-3.0 for spontaneous information dynamics. Such an undertaking lies beyond the scope of this paper. Recently, Tegmark [41] developed a comprehensive taxonomy

of all integrated information mea- sures that can be written as a distance between a probability distribution pertaining to the whole and one obtained as a product of probability distributions pertaining to the parts. Tegmark further identiﬁed a shortlist of candidate measures, based on a set of explicit desiderata. This shortlist over- laps with the measures we consider here, and also contains other measures which are minor variants. Of Tegmark’s shortlisted measures, φM is equivalent to ˜Φ under the system’s spontaneous distribu- tion, φM is its state-resolved version, φoak is transfer entropy (which we cover here through CD), kk(cid:48) and φnpk is not deﬁned for continuous variables. The measures ΦG and ψ are outside of Tegmark’s classiﬁcation scheme. 4 Results All of the measures of integrated information that we have described have the potential to behave in ways which are not obvious a priori, and in a manner diﬃcult to express analytically. While some simulations of Φ, ˜Φ and CD on networks have been performed [11, 39], Φ∗ and ΦG have not previously been computed on models consisting of more than two components, and ψ hasn’t previously been explored at all on systems with continuous variables. In this section, we study all the measures together on small networks. We compare the behaviour of the measures, and assess the extent to which each measure is genuinely capturing dynamical complexity. To recap, we consider the following 6 measures: Whole-minus-sum integrated information, Φ. Integrated stochastic interaction, ˜Φ. Decoder-based integrated information, Φ∗. Geometric integrated

information, ΦG. Integrated synergy, ψ. • • • • • Causal density, CD. • We use models based on stochastic linear auto-regressive (AR) processes with Gaussian variables. These constitute appropriate models for testing the measures of integrated information. They are straightforward to parameterise and simulate, and are amenable to the formulae presented in Section 3. Mathematically, we deﬁne an AR process (of order 1) by the update equation Xt+1 = AXt + εt, (33) where εt is a serially independent random sample from a zero-mean Gaussian distribution with given covariance Σ(ε), usually referred to as the noise or error term. A particular AR process is completely speciﬁed by the coupling matrix or network A and the noise covariance matrix Σ(ε). An AR process is stable, and stationary, if the spectral radius of the coupling matrix is less than 1 [30]. (The spectral 16 radius is the largest of the absolute values of its eigenvalues.) All the example systems we consider are calibrated to be stable, so the Φ measures can be computed from their stationary statistics. We shall consider how the measures vary with respect to: (i) the strength of connections, i.e. the magnitude of non-zero terms in the coupling matrix; (ii) the topology of the network, i.e the arrange- ment of the non-zero terms in the coupling matrix; (iii) the density of connections, i.e. the density of non-zero terms in the coupling matrix; and (iv) the correlation between noise inputs to diﬀerent system components, i.e. the oﬀ diagonal terms

in Σ(ε). The strength and density of connections can be thought of as reﬂecting, in diﬀerent ways, the level of integration in the network. The correlation between noise inputs reﬂects (inversely) the level of segregation, in some sense. We also, in each case, compute the control measures • • Time-delayed mutual information (TDMI), I(Xt Average absolute correlation ¯Σ, deﬁned as the average absolute value of the non-diagonal entries in the system’s correlation matrix. τ , Xt); and − These simple measures quantify straightforwardly the level of interdependence between elements of the system, across time and space respectively. TDMI captures the total information generated as the system transitions from one time-step to the next, and ¯Σ is another basic measure of the level of integration. We report the unnormalised measures minimised over even-sized bipartitions – i.e. bipartitions in which both parts have the same number of components. In doing this we avoid conﬂating the eﬀects of the choice of deﬁnition of eﬀective information with those of the choice of partition search (see Sec. 3.2). See Discussion (Sec. 5.1) for more on this. 4.1 Key quantities for computing the integrated information measures To compute the integrated information measures, the stationary covariance and lagged partial co- variance matrices are required. By taking the expected value of X T t Xt with Eq. (33) and given that εt is white noise, uncorrelated in time, one obtains that the stationary covariance matrix Σ(X) is given by the solution to the discrete-time Lyapunov equation, Σ(Xt) =

A Σ(Xt) AT + Σ((cid:15)t) . (34) This can be easily solved numerically, for example in Matlab via use of the dlyap command. The lagged covariance can also be calculated from the parameters of the AR process as Σ(Xt 1, Xt) = − Xt(AXt + εt)T (cid:104) (cid:105) = Σ(Xt)AT , (35) and partial covariances can be obtained by applying Eq. (7). Finally, we obtain the analogous quantities for the partitions by the marginalisation properties of the Gaussian distribution. Given a bipartition Xt = , we write the covariance and lagged covariance matrices as Mt, Nt} { Σ(Xt) = (cid:18)Σ(Xt)mm Σ(Xt)mn Σ(Xt)nm Σ(Xt)nn (cid:19) , Σ(Xt 1, Xt) = − (cid:18)Σ(Xt − Σ(Xt − 1, Xt)mm Σ(Xt − 1, Xt)nm Σ(Xt 1, Xt)mn 1, Xt)nn − 17 (cid:19) , (36) and we simply read the partition covariance matrices as Σ(Mt) = Σ(Xt)mm , 1, Mt) = Σ(Xt Σ(Mt − 1, Xt)mm . − 4.2 Two-node network We begin with the simplest non-trivial AR process, A = Σ((cid:15)) = (cid:19) (cid:19) (cid:18)a a a a (cid:18)1 c c 1 , . (37) (38a) (38b) Setting a = 0.4 we obtain the same model as depicted in Fig. 3 in Ref. [35]. We simulate the AR process with diﬀerent levels of noise correlation c and show results for all the measures in Fig. 2. Note that as c approaches 1 the system becomes degenerate, so some matrix determinants in the formulae become zero causing some measures to diverge. Figure 2: (A) Graphical

representation of the two-node AR process described in Eq. (38). Two connected nodes with coupling strength a receive noise with correlation c, which can be thought of as coming from a common source. (B) All integrated information measures for diﬀerent noise correlation levels c. Inspection of Figure 2 immediately reveals a wide variability of behaviour among the measures, in both value and trend, even for this minimally simple model. Nevertheless, some patterns emerge. Both TDMI and ΦG are unaﬀected by noise correlation, and both ˜Φ and ¯Σ grow monotonically with c. In fact, ˜Φ diverges to inﬁnity as c 1. The measures ψ, Φ∗, and CD decrease monotonically to 0 when the eﬀect of the coupling cannot be distinguished from the noise. On the other hand, Φ also decreases monotonically but becomes negative for large enough c. → 18 CouplingaCorrelationc00.51−0.4−0.200.20.40.6cIntegratedinformationTDMIΦ˜Φψ¯ΣΦ∗ΦGCDBA In Fig. 3 we analyse the same system, but now varying both noise correlation c and coupling strength a. As per the stability condition presented above, any value of a 0.5 makes the system’s spectral radius greater than or equal to 1, so the system becomes non-stationary and variances diverge. Hence in these plots we evaluate all measures for values of a below the limit a = 0.5. ≥ Figure 3: All integrated information measures for the two-node AR process described in Eq. (38), for diﬀerent coupling strengths a and noise correlation levels c. Vertical axis is inverted for visualisation purposes. Again, the measures behave very diﬀerently. In this case

TDMI and ΦG remain unaﬀected by In contrast, ˜Φ and ¯Σ noise correlation, and grow with increasing coupling strength as expected. increase with both a and c. Φ decreases with c but shows non-monotonic behaviour with a. Of all the measures, ψ, Φ∗, and CD show desirable properties consistent with capturing conjoined segregation and integration – they monotonically decrease with noise correlation and increase with coupling strength. 4.3 Eight-node networks We now turn to networks with eight nodes, enabling examination of a richer space of dynamics and topologies. We ﬁrst analyse a network optimised using a genetic algorithm to yield high Φ [11]. The noise covariance matrix has ones in the diagonal and c everywhere else, and now a is a global factor applied to all edges of the network. The adjacency matrix is scaled such that its spectral radius is 1 when a = 1. Similar to the previous section, we evaluate all measures for multiple values of a and c and show the results in Fig. 4. Moving to a larger network mostly preserves the features highlighted above. TDMI is unaﬀected 19 0.20.40.60.800.20.4aI(Xt−τ,Xt)00.20.40.60.80.20.40.60.800.20.4ΦG00.10.20.20.40.60.800.20.4Φ−0.6−0.4−0.200.20.40.60.800.20.4Φ∗0.000.050.100.150.20.40.60.800.20.4ca¯Σ0.20.40.60.80.20.40.60.800.20.4c˜Φ00.20.40.60.80.20.40.60.800.20.4cψ00.10.20.20.40.60.800.20.4cCD0.000.050.10 Figure 4: All integrated information measures for the Φ-optimal AR process proposed by [11], for diﬀerent coupling strengths a and noise correlation levels c. Vertical axis is inverted for visualisation purposes. by c; ˜Φ behaves like ¯Σ and diverges for large c; and Φ∗ and CD have the same trend as before, although now the decrease with c is less pronounced. Interestingly, ψ and ΦG

increase slightly with c, and Φ does not show the instability and negative values seen in Fig. 3. Overall, in this more complex network the eﬀect of increasing noise correlation on Φ, ψ, Φ∗, and CD is not as pronounced as in simpler networks, where these measures decrease rapidly towards zero with increasing c. Thus far we have studied the eﬀect of AR dynamics on integrated information measures, keeping the topology of the network ﬁxed and changing only global parameters. We next examine the eﬀect of network topology, on a set of 6 networks: A A fully connected network without self-loops. B The Φ-optimal binary network presented in [11]. C The Φ-optimal weighted network presented in [11]. D A bidirectional ring network. E A “small-world” network, formed by introducing two long-range connections to a bidirectional ring network. F An unidirectional ring network. In each network the adjacency matrix has been normalised to a spectral radius of 0.9. As before, we simulate the system following Eq. (33), and here set noise input correlations to zero (c = 0) so the noise input covariance matrix is just the identity matrix. Figure 5 shows connectivity diagrams of the networks for visual comparison, and Fig. 6 shows the values of all integrated information measures evaluated on all networks. 20 0.20.40.60.800.20.40.60.8aI(Xt−τ,Xt)0.00.20.40.60.20.40.60.800.20.40.60.8ΦG0.000.020.040.060.080.20.40.60.800.20.40.60.8Φ0.00.10.20.20.40.60.800.20.40.60.8Φ∗0.00.20.40.60.20.40.60.800.20.40.60.8ca¯Σ0.20.40.60.20.40.60.800.20.40.60.8c˜Φ0.00.51.01.50.20.40.60.800.20.40.60.8cψ0.000.050.100.20.40.60.800.20.40.60.8cCD01234·10−3 Figure 5: Networks used in the comparative analysis of integrated information measures. (A) Fully connected network, (B) Φ-optimal binary network from [11], (C) Φ-optimal weighted network from [11], (D) bidirectional ring network, (E) small

world network, and (F) unidirectional ring network. As before, there is substantial variability in the behaviour of all measures, but some general Intriguingly, the unidirectional ring network is consistently judged by all patterns are apparent. measures (except for ˜Φ) as the most complex, followed in most cases by the weighted Φ-optimal network.7 On the other end of the spectrum, the fully connected network A is also consistently judged as the least complex network, which is explained by the large correlation between its nodes as shown by ¯Σ. The results here can be summarised by comparing the relative complexity assigned to the networks by each measure – that is, to what extent do measures agree on which network is more complex than which. For convenience, we show the measure-dependent ranking of the network complexity in Table 3. Inspecting this table reveals a remarkable alignment between TDMI, ΦG, Φ∗, and ψ, especially given how much their behaviour diverges when varying a and c. Although the particular values are diﬀerent, the measures largely agree on the ranking of the networks based on their integrated information. This consistency of ranking is initially encouraging with regard to empirical application. However, the ranking is not what might be expected from topological complexity measures from network theory. If we ranked these networks by e.g. small-world index, we expect networks B, C, and E to be at the top and networks A, D, and F to be at the bottom – very diﬀerent from any of the rankings

in Table 3.8 In fact, the Spearman correlation between the ranking by small-world index and 7Note that in Fig. 6 the Φ-optimal networks B and C score much less than simpler network F. This is because all networks have been scaled to a spectral radius of 0.9 – when the networks are normalised to a spectral radius of 0.5, as in Ref. [11], then B and C are, as expected, the networks with highest Φ. 8The small-world index of a network is deﬁned as the ratio between its clustering coeﬃcient and its mean minimum path length, normalised by the expected value of these measures on a random network of the same density [22]. Since 21 ABCDEF Figure 6: Integrated information measures for all networks in the suite shown in Fig. 5, normalised to spectral radius 0.9 and under the inﬂuence of uncorrelated noise. The ring and weighted Φ-optimal networks score consistently at the top, while denser networks like the fully connected and the binary Φ-optimal networks are usually at the bottom. Most measures disagree on speciﬁc values but agree on the relative complexity ranking of the networks. Table 3: Networks ranked according to their value of each integrated information measure (highest value to the left). We add small-world index as a dynamics-agnostic measure of network complexity. Measure Ranking I(Xt, Xt+τ ) F C D E B A F C D E B A F C B E D A F C B E D A C B A E D

F C F B D E A F C D E B A C F B D E A ΦG Φ Φ∗ ¯Σ ˜Φ ψ CD SWI C E B A D F the networks we consider are small and sparse, we use the 4th-order cliques (instead of triangles, which are 3rd-order cliques) to calculate the clustering coeﬃcient [50]. 22 ABCDEF246I(Xt−τ,Xt)ABCDEF0.51.01.5ΦGABCDEF0.00.51.01.5ΦABCDEF0246Φ∗ABCDEF0.40.50.60.7¯ΣABCDEF0.51.01.52.0˜ΦABCDEF0123ψABCDEF0.010.020.03CD 0.4, leading to the counterintuitive conclusion that more those by TDMI, ΦG, Φ∗, and ψ is around complex networks in fact integrate less information. We note that these rankings are very robust to noise correlation (results not shown) for all measures except Φ. Indeed, across all simulations in this study the behaviour of Φ is erratic, undermining prospects for empirical application. (This behaviour is even more prevalent if Φ is optimised over all bipartitions, as opposed to over even bipartitions.) − 4.4 Random networks We next perform a more general analysis of the performance of measures of integrated information, using Erd˝os-R´enyi random networks. We consider Erd˝os-R´enyi random networks parametrised by two numbers: the edge density of the network ρ and the noise correlation c (deﬁned as above), both in the [0, 1) interval. To sample a network with a given ρ, we generate a matrix in which each possible edge is present with probability ρ and then remove self-loops. The stochasticity in the construction of the Erd˝os-R´enyi network induces ﬂuctuations on the integrated information measures, such that for each (ρ, c) we calculate the mean and variance of each

measure. Figure 7: Average integrated information measures for Erd˝os-R´enyi random networks with given density ρ and noise correlation c. First, we generate 50 networks for each point in the (ρ, c) plane and take the mean of each integrated information measure evaluated on those 50 networks. As before, the adjacency matrices are normalised to a spectral radius of 0.9. Results are shown in Fig. 7. ΦG increases markedly with ρ and moderately with c, ¯Σ increases sharply with both and the rest of the measures can be divided in two groups, with Φ, ψ and CD that decrease with c and TDMI, ˜Φ and Φ∗ that increase. Notably, all integrated information measures except ΦG show a band of high value at an intermediate value of ρ. This demonstrates their sensitivity to the level of integration. 23 0.20.40.60.800.20.40.60.8ρI(Xt−τ,Xt)02460.20.40.60.800.20.40.60.8ΦG00.10.20.30.20.40.60.800.20.40.60.8Φ−0.500.50.20.40.60.800.20.40.60.8Φ∗0240.20.40.60.800.20.40.60.8cρ¯Σ0.20.40.60.80.20.40.60.800.20.40.60.8c˜Φ0120.20.40.60.800.20.40.60.8cψ0.000.501.000.20.40.60.800.20.40.60.8cCD0.000.010.010.020.02 The decrease when ρ is increased beyond a certain point is due to the weakening of the individual connections in that case (due to the ﬁxed overall coupling strength, as quantiﬁed by spectral radius). Secondly, in Fig. 8 we plot each measure against the average correlation of each network, following the rationale that a good complexity index should peak at an intermediate value of ¯Σ – i.e. it should reach its maximum value in the middle range of ¯Σ. To obtain this ﬁgure we sampled a large number of Erd˝os-R´enyi networks with random (ρ, c), and evaluated all integrated information measures, as well as their average correlation ¯Σ. Figure 8: Integrated information measures

of random Erd˝os-R´enyi networks, plotted against the average correlation ¯Σ of the same network. (bottom) Normalised histogram of ¯Σ for all sampled networks. Fig. 8 shows that some of the measures have this intermediate peak, in particular: Φ∗, ψ, ΦG, and CD. Although also showing a modest intermediate peak, ˜Φ has a stronger overall positive trend with ¯Σ, and Φ an overall negative trend. These analyses further support Φ∗, ψ, ΦG, and CD as valid complexity measures, although the relation between them remains unclear and not always consistent in other scenarios. 24 00.20.40.60.8012¯ΣΦG00.20.40.60.805¯ΣΦ00.20.40.60.801020¯ΣΦ∗00.20.40.60.802468¯Σ˜Φ00.20.40.60.802468¯Σψ00.20.40.60.80510·10−2¯ΣCD00.20.40.60.800.511.5¯Σ One might worry that these peaks could be due to a biased sampling of the ¯Σ axis – if our sampling scheme were obtaining many more samples in, say, the 0.2 < ¯Σ < 0.4 range, then the points with high Φ we see in that range could be explained by the fact that the high-Φ tails of the distribution are sampled better in that range than in the rest of the ¯Σ axis. However, the histogram at the bottom of Fig. 8 shows this is not the case – on the contrary, the samples are relatively uniformly spread along the axis. Therefore, the peaks shown by Φ∗, ψ, ΦG, and CD are not sampling artefacts. 5 Discussion In this study we compared several candidate measures of integrated information in terms of their theoretical construction, and their behaviour when applied to the dynamics generated by a range of non-trivial network architectures. We found that no two measures

had precisely the same basic mathematical properties, see Table 2. Empirically, we found a striking variability in the behaviour among the measures even for simple systems, see Table 4 for a summary. Of the measures we have considered, ψ, Φ∗ and CD best capture conjoined segregation and integration on small networks, when animated with Gaussian linear AR dynamics (Fig. 2). These measures decrease with increasing noise input correlation and increase with increasing coupling strength (Fig. 4). Further, on random networks with ﬁxed overall coupling strength (as quantiﬁed by spectral radius), they achieve their highest scores when an intermediate number of connections are present (Fig. 7). They also obtain their highest scores when the average correlation across components takes an intermediate value (Fig. 8). Table 4: Integrated information measures considered and brief summary of our results. Measure Summary of results Φ ˜Φ ψ Φ∗ ΦG CD Erratic behaviour, negative when nodes are strongly correlated. Mostly reﬂects noise input correlation, not sensitive to changes in coupling. Consistent with reﬂecting both segregation and integration. Consistent with reﬂecting both segregation and integration. Mostly reﬂects changes in coupling, not sensitive to noise input correlation. Consistent with reﬂecting both segregation and integration. In terms of network topology, none of the measures strongly reﬂect complexity of the network structure in a graph theoretic sense. At ﬁxed overall coupling strength, a simple ring structure (Fig. 5) leads in most cases to the highest scores. Among the other measures: ˜Φ is largely determined by the level of correlation amongst

the noise inputs, and is not very sensitive to changes in coupling strength; ΦG depends mainly on the overall coupling strength, and is not very sensitive to changes in noise input correlation; and Φ generally behaves erratically. Considered together, our results motivate the continued development of ψ, Φ∗ and CD as theo- retically sound and empirically adequate measures of integrated information. 5.1 Partition selection Integrated information is typically deﬁned as the eﬀective information beyond the minimum infor- mation partition [5, 44]. However, when a particular measure of integrated information has been 25 ﬁrst introduced, it is often with a new operationalisation of both eﬀective information and the mini- mum information partition. In this paper we have restricted attention to comparing diﬀerent choices of measure of eﬀective information, while keeping the same partition selection scheme across all measures. Speciﬁcally, we restricted the partition search to even-sized bipartitions, which has the advantage of obviating the need for introducing a normalisation factor when comparing bipartitions with diﬀerent sizes, see Section 3.2. For uneven partitions, normalisation factors are required to compensate for the fact that there is less capacity for information sharing as compared to even parti- tions. However, such factors are known to introduce instabilities, both under continuous parameter changes, and in terms of numerical errors [11]. Further research is needed to compare diﬀerent approaches to deﬁning the minimum information partition, or ﬁnding an approximation to it in reasonable computation time [42]. O O O In terms of computation time, performing the most

thorough search, through all partitions, as in (nn)9. Restricting attention the early formulation of Φ by Balduzzi and Tononi [5] requires time (2n), whilst restricting to even bipartitions reduces this further to bipartitions reduces this to (n2). These observations highlight a trade-oﬀ between computation time and comprehensive to consideration of possible partitions. Future comparisons of integrated information measures may beneﬁt from more advanced methods for searching among a restricted set of partitions to obtain a good approximation to the minimum information partition. For example, Toker and Sommer use graph modularity, stochastic block models or spectral clustering as informed heuristics to suggest a small number of partitions likely to be close to the MIP, and then take the minimum over those. With these approximations they are able to calculate the MIP of networks with hundreds of nodes [42, 43]. Alternatively, Hidaka and Oizumi make use of the submodularity of mutual information to perform eﬃcient optimisation and ﬁnd the bipartition across which there is the least instantaneous mutual information of the system [21]. Presently, however, their method is valid only for instantaneous mutual information and is therefore not applicable to ﬁnding the bipartition that minimises any form of normalised eﬀective information as described in Section 3.2. Further, each measure carries special considerations regarding partition search. For example, for ψ, taking the minimum across all partitions is equivalent to taking it across bipartitions only, thanks [49, 9, 38]. Arsiwalla and Verschure [3] used ˜Φ and suggested always using the to the properties of

I atomic partition on the basis that it is fast, well-deﬁned, and for ˜Φ speciﬁcally it can be proven to be the partition of maximum information; and thus it provides a quickly computable upper bound for the measure. ∩ 5.2 Continuous variables and the linear Gaussian assumption We have compared the various integrated information measures only on systems whose states are given by continuous variables with a Gaussian distribution. This is motivated by measurement variables being best characterised as continuous in many domains of potential application. Future research should continue the comparison of these measures on a test-bed of systems with discrete vari- ables. Moreover, non-Gaussian continuous systems should also be considered because the Gaussian approximation is not always a good ﬁt to real data. For example, the spiking activity of popula- tions of neurons typically exhibit exponentially distributed dynamics [17]. Systems with discrete variables are in principle straightforward to deal with, since calculating probabilities (following the 9More precisely, as the Bell number Bn. 26 most brute-force approach) amounts simply to counting occurrences of states. General continuous systems, however, are less straightforward. Estimating generic probability densities in a continuous domain is challenging, and calculating information-theoretic quantities on these is diﬃcult [25, 46]. The AR systems we have studied here are a rare exception, in the sense that their probability den- sity can be calculated and all relevant information-theoretic quantities have an analytical expression. Nevertheless, the Gaussian assumption is common in biology, and knowing now how these measures behave on these

Gaussian systems will inform further development of these measures, and motivate their application more broadly. 5.3 Empirical as opposed to maximum entropy distribution We have considered versions of each measure that quantify information with respect to the empirical, or spontaneous, stationary distribution for the state of the system. This constitutes a signiﬁcant divergence from the supposedly fundamental measures of intrinsic integrated information of IIT versions 2 and 3 [5, 34]. Those measures are based on information gained about a hypothetical past moment in which the system was equally likely to be in any one of its possible states (the ‘maximum entropy’ distribution). However, as pointed out previously [11], it is not possible to extend those measures, developed for discrete Markovian systems, to continuous systems. This is because there is no uniquely deﬁned maximum entropy distribution for a continuous random variable (unless it has hard-bounds, i.e. a closed and bounded set of states). Hence, quantiﬁcation of information with respect to the empirical distribution is the pragmatic choice for construction of an integrated information measure applicable to continuous time-series data. The consideration of information with respect to the empirical, as opposed to maximum entropy, distribution does however have an eﬀect on the concept underlying the measure of integrated infor- mation – it results in a measure not of mechanism, but of dynamics [10]. That is, what is measured is not information about what the possible mechanistic causes of the current state could be, but rather what the likely preceding states actually are,

on average, statistically; see [11] for further dis- cussion. Given the diversity of behaviour of the various integrated information measures considered here even on small networks with linear dynamics, one must remain cautious about considering them as generalisations or approximations of the proposed ‘fundamental’ Φ measures of IIT versions 2 or 3 [5, 34]. A remaining important challenge, in many practical scenarios, is the identiﬁcation of stationary epochs. For a relatively long data segment, it can be unrealistic to assume that all the statistics are constant throughout. For shorter data segments, one can not be conﬁdent that the system has explored all the states that it potentially would have, given enough time. 6 Final remarks The further development, and empirical application of Integrated Information Theory requires a sat- isfactory informational measure of dynamical complexity. During the last few years several measures have been proposed, but their behaviour in any but the simplest cases has not been extensively characterised or compared. In this study, we have reviewed several candidate measures of integrated information, and provided a comparative analysis on simulated data, generated by simple Gaussian dynamics applied to a range of network topologies. 27 Assessing the degree of dynamical complexity, integrated information, or co-existing integration and segregation exhibited by a system remains an important outstanding challenge. Progress meeting this challenge will have implications not only for theories of consciousness, such as Integrated Infor- mation Theory, but more generally in situations where relations between local and global dynamics are of interest. The

review presented here identiﬁes promising theoretical approaches for designing adequate measures of integrated information. Further, our simulations demonstrate the need for empirical investigation of such measures, since measures that share similar theoretical properties can behave in substantially diﬀerent ways, even on simple systems. Acknowledgements The authors would like to thank Michael Schartner for advice. ABB is funded by EPSRC grant EP/L005131/1. ABB and AKS are grateful to the Dr. Mortimer and Theresa Sackler Foundation, which supports the Sackler Centre for Consciousness Science. AKS is additionally grateful to the CIFAR Azrieli programme on Mind, Brain, and Consciousness. Appendix A Derivation and concavity proof of I ∗ A.1 Derivation of I ∗ in Gaussian systems Here we provide a closed-form expression for the mismatched decoding information in a Gaussian dynamical system. See Section 3.6 for more information. For clarity, we omit the X, τ, arguments of ˜I and write it as a function of β only. The formula for ˜I(β) for a stationary continuous random process is P ˜I(β) = (cid:90) − (cid:90) dx p(x) log d˜x p(˜x)q(x (cid:90) (cid:90) d˜x ˜x)β + | dx p(x, ˜x) log q(x | ˜x)β , (39) ˜x) is where p(x) is the distribution for Xt, p(x, ˜x) is the joint distribution for (Xt, Xt | τ under the partitioning in question. The function ˜I(β) the conditional distribution for Xt given Xt − also depends on Xt, τ and , but for the sake of clarity we omit all arguments except for β, which is the

parameter of interest here. When Xt is Gaussian with covariance matrix ΣX (and mean 0 without loss of generality), we have τ ), and q(x P − p(x) = (2π)− n/2 1/2exp − ΣX | | (cid:20) − 1 2 ψ (cid:0)x, Σ− 1 X (cid:1) (cid:21) , where we deﬁne ψ(x, M ) =: xTM x for a vector x and a matrix M . Further ˜x) = (2π)− n/2 q(x | ΠX | − ˜X | | 1/2exp (cid:20) 1 2 − (cid:16) x ψ − 28 ΠX ˜X Π− 1 1 X ˜x, Π− ˜X X | (40) (41) (42) (cid:17)(cid:21) , where ΠX is the block diagonal covariance matrix for Xt under the partition, ΠX ˜X =: Σq(Xt, Xt τ ) = − ΠT ˜X is the is the block diagonal auto-covariance matrix associated with the partition, and ΠX partial covariance ˜XX | We start with the integral ΠX ˜X = ΠX − | 1 ΠX ˜X Π− X Π ˜XX . (cid:90) d˜x p(˜x)q(x ˜x)β = (2π)− | nβ/2 ΠX | ˜X | | β/2(2π)− − n/2 ΣX | | − (cid:90) 1/2 d˜x exp( E ) , where 1 2 = E 1 ˜xTΣ− X ˜x β 2 − 1 1 ˜xTΠ− X Π ˜XX Π− ˜X X | ΠX ˜X Π− 1 1 X ˜x + βxTΠ− ˜X X | ΠX ˜X Π− 1 X ˜x β 2 − 1 xTΠ− ˜X X | x . If we write then so 1 2

− (˜x − = E Bx)TQ(˜x Bx) − 1 2 − xTR1x , 1 X , ΠX ˜X Π− 1 X + βΠ− Q = Σ− 1 1 X Π ˜XX Π− ˜X X | 1 1 , X Q− ΠX ˜X Π− 1 β2Π− ˜X X | 1 BT = βΠ− ˜X X | 1 R1 = βΠ− ˜X − X | ΠX ˜X Π− 1 X Q− 1 1 1Π− X Π ˜XX Π− ˜X X | , (cid:90) d˜x exp( E ) = exp = exp (cid:18) − (cid:18) − 1 2 1 2 (cid:19) (cid:90) xTR1x dy exp (cid:18) 1 2 − (cid:19) yTQy (cid:19) xTR1x (2π)n/2 1/2 . Q − | | (43) (44) (45) (46) (47a) (47b) (47c) (48) Hence, using (40) and (44) we obtain the ﬁrst term in (39): (cid:90) − (cid:90) dx p(x) log d˜x p(˜x)q(x ˜x)β = | nβ 2 log 2π (cid:16) log + 1 2 Q | | · | ΣX | · | ΠX | β(cid:17) ˜X | + 1 2 tr(ΣX R1) . (49) Now, moving on to the second term in (39), (cid:90) (cid:90) d˜x dx p(x, ˜x) log q(x ˜x)β = | − βn 2 log 2π β 2 − log ΠX | ˜X | − | β 2 I1 , (50) 29 where I1 = (cid:90) (cid:90) (cid:90) d˜x dx p(x, ˜x) ψ (cid:16) x (cid:17) 1 1 ΠX ˜X Π− X ˜x, Π− ˜X X (cid:90) (cid:16) | d˜x p(˜x) ψ −

+ (cid:16) 1 x, Π− ˜X X (cid:17) | 1 dx p(x, ˜x) xTΠ− ˜X X (cid:16) (cid:17) | 1 1 Π− X Π ˜XX Π− ˜X X + tr 1 ΠX ˜X Π− X ˜x = dx p(x) ψ (cid:90) (cid:90) 2 d˜x (cid:16) 1 Π− ˜X X | ΣX − = tr ˜x, Π− 1 1 X Π ˜XX Π− ˜X X | (cid:17) 1 ΠX ˜X Π− X ΠX ˜X Π− 1 X ΣX (cid:16) 2 tr (cid:17) − 1 Π− ˜X X | 1 ΠX ˜X Π− X Σ ˜XX (cid:17) , (51) | where Σ ˜XX =: Σ(Xt − τ , Xt). Thus the second term in (39) is given by (cid:90) (cid:90) d˜x dx p(x, ˜x) log q(x ˜x)β = | − βn 2 log 2π β 2 − log ΠX | ˜X | | + 1 2 tr(ΣX R2) + β tr (cid:16) 1 Π− ˜X X | 1 X Σ ˜XX ΠX ˜X Π− (cid:17) , (52) where R2 = | Finally, putting all the terms (49), (52) together, we obtain | 1 βΠ− ˜X − X − 1 1 X Π ˜XX Π− βΠ− ˜X X 1 X . ΠX ˜X Π− ˜I(β) = 1 2 log ( | Q | · | ) + ΣX | 1 2 tr(ΣX R) + β tr (cid:16) 1 Π− ˜X X | ΠX ˜X Π− 1 X Σ ˜XX (cid:17) , where Q = Σ− R = 1 1 1 1 X ,

ΠX ˜X Π− X Π ˜XX Π− X + βΠ− ˜X X | 1 1 1 1 β2Π− ΠX ˜X Π− X Π ˜XX Π− βΠ− ˜X ˜X X − X X − | ΠX ˜X Π− 1 X Q− 1 1 1Π− X Π ˜XX Π− ˜X X | . | (53) (54) (55) (56) We note that this formula for ˜I(β) has been veriﬁed with numerical methods, and it is not the same as the formula reported by Oizumi et al. [35]. A.2 ˜I(β) is concave in β in Gaussian systems Throughout this proof we will rely multiple times on the the book Convex Optimization by Boyd and Vandenberghe [14]. Our aim is to show that ˜I(β) is concave in β,10 which means it has a unique maximum and can be treated with standard convex optimisation tools. We start with the second term in Eq. (39), (cid:90) (cid:90) d˜x dx p(x, ˜x) log q(x | ˜x)β = β (cid:90) (cid:90) d˜x dx p(x, ˜x) log q(x ˜x) , | (57) 10We follow Boyd and Vandenberghe’s notation: a function f is said to be convex, convex downwards or concave upwards if f (ax + by) af (x) + bf (y). ≤ 30 which is linear in β. Moving to the ﬁrst term, using Eq. (42) it can be rewritten as (cid:90) − (cid:20)(cid:90) dx p(x) log d˜x p(˜x)q(x (cid:21) (cid:90) = − ˜x)β | dx p(x) nβ 2 (cid:90) (cid:20) − − log 2π (cid:21) β 2 − log

ΠX | ˜X | | dx p(x) log [p(˜x) exp ( βf (x, ˜x)) d˜x] . − We see that the only nonlinear term in ˜I(β) is (cid:20)(cid:90) (cid:90) dx p(x) log − d˜x p(˜x) exp( βf (x, ˜x) − (cid:21) where − Now we draw from two lemmas presented in [14]: f (x, ˜x) = (cid:16) ψ x 1 2 1 1 ΠX ˜X Π− X ˜x, Π− ˜X X | (cid:17) . , (58) (59) • • An aﬃne function preserves concavity, in the sense that a linear combination of convex (con- cave) functions is also convex (concave). A non-negative weighted sum preserves concavity. Since p(x) > 0 the outer integral in Eq. (58) preserves concavity, With these two remarks, we know that to prove the concavity of ˜I(β) we just need to prove the concavity of (cid:20)(cid:90) (cid:21) log − d˜x p(˜x) exp ( βf (x, ˜x)) − . (60) This is known as a log-sum-exp function, which as per Sec. 3.1.5 of [14] is convex in β. Finally, the minus sign in the last equation ﬂips the convexity and we conclude that ˜I(β) is concave in β. B Bounds on causal density We now prove that causal density is upper-bounded by time-delayed mutual information, satisfying what other authors have considered a fundamental requirement for a measure of integrated infor- mation [37]. As before, we omit the arguments to CD for clarity. We begin by writing down CD in terms of mutual information: CD = n(n =

n(n 1 − 1 − (cid:88) i =j (cid:88) i =j 1) 1) TEτ (X i X j X [ij]) | → I(X i t ; X j X [i] t ) , t+τ | (61) where as before X [i] t mutual information [16], represents the set of all variables in Xt except X i t . We will use the chain rule of I(X; Y, Z) = I(X; Z) + I(X; Y Z) . | (62) 31 (cid:54) (cid:54) Using this chain rule and the non-negativity of mutual information we can state that I(X i I(Xt; X j t+τ ), and therefore t ; X j t+τ | CD 1 ≤ n(n 1) − (cid:88) i =j I(Xt; X i t+τ ) . Also by the same chain rule, it is easy to see that I(Xt; X i t+τ ) I(Xt; Xt+τ ). Then ≤ CD 1 ≤ n(n 1) − (cid:88) i =j I(Xt; Xt+τ ) . Given that the sum runs across all n(n 1) pairs, we arrive at our result − CD ≤ I(Xt; Xt+τ ) . C Properties of integrated information measures X [i] t ) ≤ (63) (64) (65) We prove the properties of in Table 2. We will make use of the properties of mutual information introduced in Sec. 2, repeated here for convenience: MI-1 I(X; Y ) = I(Y ; X), MI-2 I(X; Y ) 0, ≥ MI-3 I(f (X); g(Y )) = I(X; Y ) for any injective functions f,

g, Whole-minus-sum integrated information Φ Time-symmetric Follows from (MI-1). Non-negative Proof by example. If X i t = X j t , we have Φ = (1 N )I(X i t ; X i t − τ ) 0. ≤ − Rescaling-invariant Follows from (MI-3) when Balduzzi and Tononi’s [5] normalisation factor is not used. Bounded by TDMI Follows from (MI-2). Integrated stochastic interaction ˜Φ Time-symmetric Follows from H(Xt| system temporal joint entropy Ht τ ) = H(Xt − Ht), which can be proved starting from the τ | − H(Xt, Xt − τ ) = H(Xt| = H(Xt Xt τ ) + H(Xt − τ , Xt) = H(Xt − τ ) − Xt) + H(Xt) , τ | − And using the fact that by the ergodic property H(Xt) = H(Xt all parts of the system. − τ ). The same logic applies to 32 (cid:54) (cid:54) Non-negative Follows from the fact that ˜Φ is an M-projection (see Ref. [37]). Rescaling-invariant Follows from the non-invariance of diﬀerential entropy [16] (regardless of whether a normalisation factor is used). Bounded by TDMI Proof by counterexample. In the AR process of Sec. 4.2 ˜Φ although TDMI remains ﬁnite. → ∞ as c 1, → Integrated synergy ψ Time-symmetric Proof by counterexample. For the AR system with (cid:18)1 0 0 1 (cid:18)a a 0 0 Σ(ε) = A = (cid:19) , (cid:19) We have ψ = 1 this proof applies only to the MMI-PID used in this paper and presented in [9].

2 log (cid:0)1 + a2(cid:1) while for the time-reversed process ψ = 1 2 log (cid:0)1 + a4(cid:1). Note that Non-negative Follows from I ∪ Rescaling-invariant Follows from (MI-3) and the fact that I (X, Y ; Z) < I( { X, Y ; Z) [49]. } in Section 5 of [19]). is also invariant (see property (Eq) ∩ Bounded by TDMI Follows from the non-negativity of I ∪ [49]. Decoder-based integrated information Φ∗ Non-negative Follows from I ∗[X; τ, ] P ≤ I(Xt; Xt τ ), proven in [33]. − Rescaling-invariant Assume the measure is computed on a time series of rescaled data X r t = XtA, where A is a diagonal matrix with positive real numbers. Then its covariance is related to the t XtA(cid:3) = A2ΣX . We (cid:3) = E (cid:2)ATX T X = E (cid:2)X r covariance of the original time series as Σr can analogously calculate ΠX , ΠX ˜X , ΠX ˜X and easily verify that all A’s cancel out, proving the invariance. TX r t t | Bounded by TDMI Follows from I ∗[X; τ, ] P ≥ 0, proven in [33]. Geometric integrated information ΦG Time-symmetric Follows from the symmetry in the constraints that deﬁne the manifold of re- stricted models Q [37]. Non-negative Follows from the fact that ΦG is an M-projection [37]. Rescaling-invariant Given a Gaussian distribution p with covariance Σp, its M-projection in Q is another Gaussian with covariance Σq. Given a new distribution p(cid:48) formed by

rescaling some of the variables in p, the M-projection of p(cid:48) is a Gaussian with covariance A2Σq with A a diagonal positive matrix (see above), which satisﬁes DKL(p q(cid:48)) and therefore ΦG is invariant to rescaling. q) = DKL(p(cid:48)(cid:107) (cid:107) 33 Bounded by TDMI TDMI can be deﬁned as the M-projection of the full model p to a manifold of τ ) τ ) = q(Xt)q(Xt I(Xt; Xt τ ) } [37]. The bound ΦG ≤ − restricted models QM I = follows from the fact that QM I q : q(Xt, Xt { Q. − − ⊂ Causal density Time-symmetric Follows from the non-symmetry of transfer entropy [47]. Non-negative Re-writing CD as a sum of conditional MI terms, follows from (MI-2). Rescaling-invariant Follows from (MI-3). Bounded by TDMI Proven in Appendix B. References [1] S.-i. Amari. Information Geometry in Optimization, Machine Learning and Statistical Inference. Frontiers of Electrical and Electronic Engineering in China, 5(3):241–260, 2010. [2] S.-i. Amari and H. Nagaoka. Methods of Information Geometry. 2000. [3] X. D. Arsiwalla and P. F. M. J. Verschure. Integrated Information for Large Complex Networks. In The 2013 International Joint Conference on Neural Networks (IJCNN), pages 1–7. IEEE, 2013. [4] N. Ay. Information Geometry on Complexity and Stochastic Interaction. Entropy, 17(4):2432– 2458, 2015. [5] D. Balduzzi and G. Tononi. Integrated Information in Discrete Dynamical Dystems: Motivation and Theoretical Framework. PLoS Computational Biology, 4(6):e1000091, 2008. [6] L. Barnett, A. B. Barrett, and A. K. Seth. Granger Causality and Transfer Entropy Are Equivalent for

Gaussian Variables. Physical Review Letters, 103(23):238701, 2009. [7] L. Barnett and A. K. Seth. Behaviour of Granger Causality under Filtering: Theoretical Invari- ance and Practical Application. Journal of Neuroscience Methods, 201(2):404–419, 2011. [8] L. Barnett and A. K. Seth. The MVGC Multivariate Granger Causality Toolbox: A New Approach to Granger-causal Inference. Journal of Neuroscience Methods, 223:50–68, 2014. [9] A. B. Barrett. An Exploration of Synergistic and Redundant Information Sharing in Static and Dynamical Gaussian Systems. 2014, arXiv:1411.2832. [10] A. B. Barrett and L. Barnett. Granger Causality is Designed to Measure Eﬀect, not Mechanism. Frontiers in Neuroinformatics, 7:6, 2013. [11] A. B. Barrett and A. K. Seth. Practical Measures of Integrated Information for Time-series Data. PLoS Computational Biology, 7(1):e1001052, 2011. 34 [12] N. Bertschinger, J. Rauh, E. Olbrich, and J. Jost. Shared Information – New Insights and Problems in Decomposing Information in Complex Systems. In T. Gilbert, M. Kirkilionis, and G. Nicolis, editors, Proceedings of the European Conference on Complex Systems 2012, Springer Proceedings in Complexity. Springer International Publishing, 2012, arXiv:1210.5902. [13] N. Bertschinger, J. Rauh, E. Olbrich, J. Jost, and N. Ay. Quantifying Unique Information. Entropy, 16(4):2161–2183, 2014, arXiv:1311.2852. [14] S. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004, arXiv:1111.6189v1. [15] M. A. Cerullo. The Problem with Phi: A Critique of Integrated Information Theory. PLoS Computational Biology, 11(9):e1004286, 2015. [16] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, Hoboken, 2006. [17] P. Dayan and L. F. Abbott. Theoretical Neuroscience: Computational and Mathematical Mod-

eling of Neural Systems. MIT Press, Cambridge, MA, 2001. [18] C. W. J. Granger. Investigating Causal Relations by Econometric Models and Cross-spectral Methods. Econometrica, 37(3):424, 1969. [19] V. Griﬃth. A Principled Infotheoretic phi-like Measure. 2014, arXiv:1401.0978. \ [20] V. Griﬃth and C. Koch. Quantifying Synergistic Mutual Information. 2012, arXiv:1205.4265. [21] S. Hidaka and M. Oizumi. Fast and Exact Search for the Partition with Minimal Information Loss. 2017, arXiv:1708.01444. [22] M. D. Humphries and K. Gurney. Network ‘Small-world-ness:’ a Quantitative Method for Determining Canonical Network Equivalence. PLoS ONE, 3(4):e0002051, 2008. [23] R. A. A. Ince. Measuring Multivariate Redundant Information with Pointwise Common Change in Surprisal. Entropy, 19(7):318, 2017, arXiv:1602.05063. [24] J. W. Kay and R. A. A. Ince. Exact Partial Information Decompositions for Gaussian Systems Based on Dependency Constraints. 2018, arXiv:1803.02030. [25] A. Kraskov, H. St¨ogbauer, and P. Grassberger. Estimating Mutual Information. Physical Review E, 69(6):066138, 2004. [26] S. Krohn and D. Ostwald. Computing Integrated Information. 2016, arXiv:1610.03627. [27] P. E. Latham and S. Nirenberg. Synergy, Redundancy, and Independence in Population Codes, Revisited. The Journal of Neuroscience, 25(21):5195–206, 2005. [28] M. Lindner, R. Vicente, V. Priesemann, and M. Wibral. TRENTOOL: a Matlab Open Source Toolbox to Analyse Information Flow in Time Series Data with Transfer Entropy. BMC Neu- roscience, 12(1):119, 2011. 35 [29] J. T. Lizier, J. Heinzle, A. Horstmann, J.-D. Haynes, and M. Prokopenko. Multivariate Information-theoretic Measures Reveal Directed Information Structure and Task Relevant Changes in fMRI Connectivity. Journal of Computational Neuroscience, 30(1):85–107, 2010. [30] H. L¨utkepohl. New Introduction to

Multiple Time Series Analysis. New York, 2005. [31] P. A. M. Mediano, J. C. Farah, and M. P. Shanahan. Integrated Information and Metastability in Systems of Coupled Oscillators. 2016, arXiv:1606.08313. [32] P. A. M. Mediano and M. P. Shanahan. Balanced Information Storage and Transfer in Modular Spiking Neural Networks. 2017, arXiv:1708.04392. [33] N. Merhav, G. Kaplan, A. Lapidoth, and S. Shamai Shitz. On Information Rates for Mismatched Decoders. IEEE Transactions on Information Theory, 40(6):1953–1967, 1994. [34] M. Oizumi, L. Albantakis, and G. Tononi. From the Phenomenology to the Mechanisms of Con- sciousness: Integrated Information Theory 3.0. PLoS Computational Biology, 10(5):e1003588, 2014. [35] M. Oizumi, S.-i. Amari, T. Yanagawa, N. Fujii, and N. Tsuchiya. Measuring Integrated Infor- mation from the Decoding Perspective. 2015, arXiv:1505.04368. [36] M. Oizumi, T. Ishii, K. Ishibashi, T. Hosoya, and M. Okada. Mismatched Decoding in the Brain. The Journal of Neuroscience, 30(13):4815–26, 2010. [37] M. Oizumi, N. Tsuchiya, and S.-i. Amari. A Uniﬁed Framework for Information Integration Based on Information Geometry. 2015, arXiv:1510.04455. [38] F. Rosas, V. Ntranos, C. Ellison, S. Pollin, and M. Verhelst. Understanding Interdependency Through Complex Information Sharing. Entropy, 18(2):38, 2016. [39] A. K. Seth, A. B. Barrett, and L. Barnett. Causal Density and Integrated Information as Measures of Conscious Level. Philosophical Transactions A, 369(1952):3748–67, 2011. [40] A. K. Seth, E. Izhikevich, G. N. Reeke, and G. M. Edelman. Theories and Measures of Con- sciousness: An Extended Framework. Proceedings of the National Academy of Sciences, 2006. [41] M. Tegmark. Improved Measures of Integrated Information.

2016, arXiv:1601.02626. [42] D. Toker and F. Sommer. Moving Past the Minimum Information Partition: How To Quickly and Accurately Calculate Integrated Information. 2016, arXiv:1605.01096. [43] D. Toker and F. T. Sommer. Great Than The Sum: Integrated Information In Large Brain Networks. 2017, arXiv:1708.02967. [44] G. Tononi and O. Sporns. Measuring Information Integration. BMC Neuroscience, 4(3), 2003. [45] G. Tononi, O. Sporns, and G. M. Edelman. A Measure for Brain Complexity: Relating Func- tional Segregation and Integration in the Nervous System. Proceedings of the National Academy of Sciences, 91(11):5033–7, 1994. 36 [46] Q. Wang, S. R. Kulkarni, and S. Verdu. Divergence Estimation for Multidimensional Densities Via k-Nearest-Neighbor Distances. IEEE Transactions on Information Theory, 55(5):2392–2405, 2009. [47] M. Wibral, R. Vicente, and J. T. Lizier, editors. Directed Information Measures in Neuroscience. Understanding Complex Systems. Springer Berlin Heidelberg, Berlin, Heidelberg, 2014. [48] K. Wiesner, M. Gu, E. Rieper, and V. Vedral. Information-theoretic Bound on the Energy Cost of Stochastic Simulation, 2011, arXiv:1110.4217. [49] P. L. Williams and R. D. Beer. Nonnegative Decomposition of Multivariate Information. 2010, arXiv:1004.2515. [50] H. Yin, A. R. Benson, and J. Leskovec. Higher-order Clustering in Networks. 2017, arXiv:1704.03913. 37

Archive: https://archive.tmtresearch.org/transmaterialization-com/iit-30.html
Original source: https://transmaterialization.com