(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at

yjhan@stanford.edu.)

**1. Overview **

Most scientific research are devoted to developing new tools or findings for challenging problems arising in practice, which help to provide explicit solutions to real problems. These works are typically called *constructive*, and characterize tasks which are possible to accomplish. However, there is also another line of work which focuses on the other side, i.e., provides fundamental limits on certain tasks. In other words, these works are mostly *non-constructive* and illustrate targets which are impossible to achieve. Such results are also useful for a number of reasons; we may certify the fundemental limits and optimality of some procedures so that it becomes unnecessary to look for better ones, we may understand the limitations of the current framework, and we may compute the amount of resources necessary to complete certain tasks.

Among different types of fundamental limits (or *converse* in information theory, *lower bounds* broadly used in a number of fields), throughout these lectures we will be interested in so-called *information-theoretic* ones. When it comes to the information-theoretic lower bounds, typically one makes observations (either passively or actively) subject to some prescribed rules and argues that certain tasks cannot be accomplished using *any* tool solely based on these observations. To show these impossibility results, one argues through the line that the amount of information contained in these observations is insufficient to perform these tasks. Therefore, the limited-information structure is of utmost importance in the previous arguments, and the structure occurs in a number of fields including statistics, machine learning, optimization, reinforcement learning, to name a few; e.g., a given size of samples to perform inference, a given amount of training data to learn a statistical model, a given number of gradient evaluations to optimize a convex function, a given round of adaptivity to search for a good policy, etc.

These lectures are devoted to providing an overview of (classical and modern) tools to establish information-theoretic lower bounds. We will neither be restricted to certain problems (e.g., statistical inference, optimization, bandit problems), nor restrict ourselves to tools in information theory. Instead, we will try to present an extensive set of tools/ideas which are suitable for different problem structures, followed by numerous interdisplinary examples. We will see that certain tools/ideas can be applied to many seemingly unrelated problems.

**2. Usefulness of Lower Bounds **

We ask the following question: why do we care about lower bounds? We remark that the usefulness of lower bounds is not restricted to providing fundamental limits and telling people which are impossible. In fact, the power of understanding lower bounds lies more on the upper bound in the sense that it helps us to understand the problem structure better, including figuring out the most difficult part of the problem and the most essential part of information which should be made full use of. In other words, the lower bounds are interwined with the upper bounds, and should be in no means treated as an independent component. We elaborate on this point via the following examples from puzzle games.

**2.1. Example I: Card Guessing Game **

There is a magic show as follows: there is a 52-card deck thoroughly shuffled by the audience. Alice draws 5 cards from the top of the deck and reveals 4 of them to Bob one after one, and then Bob can always correctly guess the remaining card in Alice’s hand. How can Alice and Bob achieve that?

Instead of proposing an explicit strategy of Alice and Bob, let us look at the information-theoretic limit of this game first. Suppose the deck consists of n cards, what are the possible values of n such that such a strategy still exists? To prove such a bound, we need to understand what are the possible strategies. From Alice’s side, her strategy is simply a mapping f from unordered 5-tuples to an ordered 4-tuple, with an additional restriction that f(A)\subseteq A for any unordered 5-tuples A\subseteq [n]. From Bob’s side, his strategy is another mapping g from ordered 4-tuples to a specific card in [n]. Finally, the correcntess of Bob’s guessing corresponds to

\{ g(f(A))\} \cup f(A) = A, \qquad \forall A\subseteq [n], |A|=5. \ \ \ \ \ (1)An equivalent way to state (1) is that, Bob can recover all 5 cards after observing the first ordered 4 cards; write h for this strategy. Now we come to our information-theoretic observation: do ordered 4-tuples contain enough information to recover any unordered 5-tuples? Here we will quantify the information as cardinality: mathematically speaking, the mapping h must be surjective, so |\text{dom}(h)|\ge |\text{range}(h)|. In other words,

4!\cdot \binom{n}{4} \ge \binom{n}{5} \Longrightarrow n \le 124. \ \ \ \ \ (2)Hence, this magic show will fail for any deck with more than 124 cards. The next question is that, is n=124 achievable? Now we are seeking an upper bound, but the previous lower bound still helps – the equality in (2) holds implies that h must be a bijection! Keeping the bijective nature of h in mind, it then becomes not too hard to propose the following strategy which works. Label all cards by 1,2,\cdots,124, and suppose the chosen cards be c_1<c_2<\cdots<c_5. Alice computes the sum s=\sum_{i=1}^5 c_i \pmod 5, keeps c_s and reveals all others d_1<d_2<d_3<d_4 to Bob. Let t=\sum_{i=1}^4 d_i \pmod 5, then for Bob to decode c_s, he only needs to solve the following equation:

c = - t + \min\{i\in [4]: c<d_i \} \pmod 5, \ \ \ \ \ (3)where we define \min\emptyset = 5. It is easy to show that (3) always admits 24 solutions in [124], and any solution can be encoded using 4!=24 different permutations of the revealed cards (d_1,\cdots,d_4).

**2.2. Example II: Coin Flipping Game **

Now Alice and Bob play another game cooperatively: consider n coins on the table each of which may be head or tail, and the initial state is unknown to both Alice and Bob. The audience tells a number in [n] to Alice, and she comes to the table, looks at all coins and flips one of them. Bob then comes to the table and tells the number told by the audience correctly. The question is: for which values of n can they come up with such a strategy?

To answer this question, again we need to understand the structure of all strategies. Let s\in \{0,1\}^n be the initial state of the coins, which may be arbitrary. We may also identify s as vertices of an n-dimensional hypercube. Let k\in [n] be the number given by the audience. Alice’s strategy is then a mapping f: \{0,1\}^n\times [n]\rightarrow \{e_1,\cdots,e_n\}, choosing a coin to flip (where e_i denotes the i-th canonical vector) based on her knowledge. Then Bob’s strategy is a mapping g: \{0,1\}^n\rightarrow [n], and the correctness condition implies that

g(s \oplus f(s,k)) = k, \qquad \forall s\in\{0,1\}^n, k\in[n]. \ \ \ \ \ (4)It is clear from (4) that the map k\mapsto f(s,k) for any s must be injective. By cardinality arguments, this map is further bijective. Now the problem structure becomes more transparent: for each k\in [n], let g^{-1}(k)\subseteq \{0,1\}^n be the states where Bob will claim k. We call these vertices in the hypercube have *k-th color*. Then (4) states that, for each vertex s\in \{0,1\}^n, its n different neighbors \{s\oplus e_i \}_{i\in [n]} must have n different colors. The converse is also true: if there exists such a coloring scheme, then we may find f,g such that (4) holds.

Now when does such a coloring scheme exist? A simple idea is that, the number of vertices with any color should be the same by double counting arguments. Hence, we must have n \mid 2^n, which implies that n=2^m must be a power of 2. Based on the coloring intuition given by the lower bound, the strategy also becomes simple: consider any finite Abelian group G=\{g_1,\cdots,g_n\} with n elements, we identify colors as elements of G and let s\in \{0,1\}^n have color

G(s) = s_1g_1 + s_2g_2 + \cdots + s_ng_n \in G. \ \ \ \ \ (5)It is straightforward to verify that (4) holds if and only if G has characteristic 2, i.e., 2g=0 for any g\in G. By algebra, such an Abelian group exists if and only if n is a power of 2, e.g., G=\mathbb{F}_2^m when n=2^m (necessity can be shown via taking quotients G/\{0,g\} repeatedly). Hence, such a strategy exists if and only if n=2^m is a power of 2.

**2.3. Example III: Hat Guessing Game **

Now we look at a hat-guessing game with a more complicated information structure. There are 15 people who are sitting together, each of whom wears a hat of color red or blue, independently with probability \frac{1}{2}. They cannot talk to each other, and they can see the color of all hats except for their own. Now each of them simultaneously chooses to guess the color of his own hat, or chooses to pass. They win if at least one person guesses the color, and all guesses are correct. What is their optimal winning probability of this game?

The answer to this question is \frac{15}{16}, which is shocking at the first appearance because it greatly outperforms \frac{1}{2} achieved by the naive guessing scheme where only the first person guesses. To understand this improvement, we first think about the fundamental limit of this problem. Let s\in \{0,1\}^{15} be the sequence of hat colors, and s is called *success* if they win under state s, and *failure* if fail. Clearly, at each success state s, there must be some person i\in [15] who makes the correct guess. However, since he cannot see s_i, he will make the same guess even if s_i is flipped, and this guess becomes wrong in the new state. This argument seems to suggest that the \frac{1}{2} winning probability is not improvable, for each success state corresponds to a failure state and therefore at most consistutes half of the states. However, this intuition is wrong since multiple success states may correspond to the same failure state.

Mathematically speaking, let S be the success states and F be the failure states. By the previous argument, there exists a map f: S\rightarrow F which only flips one coordinate. Since there are at most 15 coordinates which can be flipped, we have |f^{-1}(s)|\le 15 for each s\in F. Consequently, |S|\le 15|F|, and the winning probability is

\frac{|S|}{|S|+|F|} \le \frac{15}{16}. \ \ \ \ \ (6)The above lower bound argument shows that, in order to achieve the optimal winning probability, we must have |f^{-1}(s)|=15 for any s\in F. Consequently, in any failure state, it must be the case where *everyone* makes wrong guesses. This crucial observation motivates the following strategy: identify people as elements of \mathbb{F}_2^4 -\{0\}. Each person v computes the sums s_1, s_2 of the people indices with red and blue hats, respectively. If v=s_1, he claims blue; if v=s_2, he claims red; otherwise he passes. Then it becomes obvious that a failure occurs if and only if the indices of people with red hats sum into 0\in \mathbb{F}_2^4, whose probability is (\frac{1}{2})^4 = \frac{1}{16}.

**3. Bibliographic Notes **

There are tons of information-theoretic analysis used to solve puzzles. We recommend two books if the readers would like to learn more about funny puzzles:

- Peter Winkler,
*Mathematical puzzles: a connoisseur’s collection.*AK Peters/CRC Press, 2003. - Jiri Matousek,
*Thirty-three miniatures: Mathematical and Algorithmic applications of Linear Algebra.*Providence, RI: American Mathematical Society, 2010.

The examples are interesting. But the overview seems a little obscure for me.