Archive

Archive for the ‘theory’ Category

Submodularity

June 22nd, 2009 No comments

Last week a heard at least twice about submodularity what made me want to write about. First, it was in the talk by Carlos Guestrin in the first CompSust conference. His very interesting point was: Machine Learning explored Convexity in the core of many machine learning algorithm (SVMs, for example). We should look for other general models, other than convexity, that have interesting properties and so that we could design algorithms based on that. He proposes submodularity as a natural candidate.

The idea of submodular functions generalizes in a certain way the concept of concave functions. The main property captured is the disminishing returns property. Consider a set {S} and the set {2^S} of its subsets. We say that a function {f: 2^S \rightarrow \mathbb R} is submodular if:

\displaystyle f(S \cup \{z\}) - f(S) \geq f(T \cup \{z\}) - f(T) \text{ for } S \subseteq T

that is, adding a new element {z} to a larger set has smaller marginal gain than adding it to a smaller set. This is very natural in a lot of circumstances: for example, let {S} be a set of books (or papers) anf {f(S)} be the knowledge you get by reading each of them. Since the topics overlap, reading an extra book when you already read {100} has a smaller marginal gain (or at least no greater) than reading it when you read only {10}. One can think of it as a natural generalization of concave functions, which also have this disminishing returns propery. Actually if {\phi} is a concave function, {f(S) = \phi(\vert S \vert)} is submodular.

In the Cascading Behavior in Networks, Jon Kleinberg described a very interesting submodular function: considere a graph {G = (V,E)} and a epidemia spreading through the nodes. Suppose an initial set {S \subseteq V} of infected nodes. As soon as one node {u} becomes infected, each node {v}, s.t. {(u,v) \in E} becomes infected with probability {p}. Let {f(S)} be the expected number of infected nodes in the end. It can be proved that {f} is submodular. One can think of this infection as being some information spreading in a network (say, the blogosphere). Then, if we start spreading this information through {k} nodes, which nodes should we pick to make sure most nodes will get the information in the end?

Another example is vertex cover: we can think of {S \subseteq V} as a set of nodes in a graph and {f(S)} the number of edges covered by those nodes. It is nor hard to see that this function is submodular. Set cover and other covering problems can be modelled as submodular functions. Normally, the main question is something like: how can I maximize {f} given that I can pick at most {k} elements for {S}, or, stated in mathematical terms:

\displaystyle \max_{\vert S \vert \leq k} f(S)

The fact that {f} is submodular makes it easy to approximate it using a greedy algorithm, using the following very interesting result:

Theorem 1 (Nemhauser et al, 78) Given a monotonic (i.e {f(S) \leq f(T)} for {S \subseteq T}) submodular function {f : 2^X \rightarrow {\mathbb R}_+} with {f(\emptyset) = 0}, the set {S = S_k} obtained by taking {S_0 = \emptyset} and

\displaystyle S_{i+1} = S_i \cup \{z_i\} \text{ where } z_i \in \text{argmin}_{z \notin S_i} f(S_i + z)

is a {\left( 1 - \frac{1}{e} \right)} approximation to {\max_{\vert S \vert \leq k} f(S)}.

Proof: First, notice that:

\displaystyle \begin{aligned}f(A \cup B) & = \sum_{j=1}^n f(A \cup \{b_1, \hdots, b_j \}) - f(A \cup \{b_1, \hdots, b_{j-1} \}) \\ & \leq \sum_{j=1}^n f(A \cup \{b_j \}) - f(A) \end{aligned}

where the last part follows by submodularity. Let {T} be the optimal solution for the problem, so we have that:

\displaystyle f(T) \leq f(S_i \cup T) \leq f(S_i) + \sum_{t \in T} f(S_i \cup \{ t \}) - f(S_i) \leq f(S_i) + \sum_{t \in T} f(S_{i+1}) - f(S_i)

because of our choice of {z_i}. So, if {\delta_i = f(S_i) - f(S_{i-1})} then we know that:

\displaystyle f(T) \leq f(S_i) + k \delta_{i+1} \Rightarrow \delta_{i+1} \geq \frac{f(T) - f(S_i)}{k}

and therefore:

\displaystyle f(S_{i+1}) = f(S_i) + \delta_{i+1} \geq \frac{1}{k} f(T) + \left( 1 - \frac{1}{k} \right) f(S_i)

Solving that recursion, it is easy to see that:

\displaystyle f(S_i) \geq \left[ 1 - \left( 1 - \frac{1}{k} \right)^i \right] f(T)

what gives the {1-1/e} bound.\Box

The second time I heard about submodularity last week was in a nice talk in the Theory Discussion Group about the paper Approximating submodular functions everywhere. Basically there is a monotone submodular function {f:2^{[n]} \rightarrow \mathbb R}. The values of {f(S)} are calculated through an oracle. The paper shows how to construct a function {\hat{f}(S)} that approximated {f(S)} everywhere just by calling the oracle a {poly(n)} number of times. The approximation is such that:

\displaystyle \hat{f}(S) \leq f(S) \leq \tilde{O}(\sqrt{n}) \hat{f}(S)

The paper is very deep, so I will not comment about it, but about some nice things I learned in the introduction of the talk about submodular functions and polymatroids. The polymatroid is the following polyhedron associated with a submodular function:

\displaystyle P_f = \left\{ x \in {\mathbb R}_+^n; x(S) \leq f(S), \forall S \subseteq [n] \right\}

where {x(S) = \sum_{i \in S} x_i}.

One nice thing about those objects is that we can easily optimize over them using a greedy algorithm:

Theorem 2 Considere a weight function {w:[n] \rightarrow {\mathbb R}_+} and a monotone submodular function {f:2^{[n]} \rightarrow {\mathbb R}_+}. Then the problem:

\displaystyle \min_{x \in P_f} \langle w, x\rangle

can be solved by the following greedy algorithm: order the elements in the set according to the {w}-value, i.e, {[n] = \{ s_1, \hdots, s_n\}} with {w(s_1) \geq w(s_2) \geq \hdots \geq w(s_n) } and define

\displaystyle x(s_i) = f(\{s_1, \hdots, s_i\}) - f(\{s_1, \hdots, s_{i-1}\})

Proof: We need to prove that {x} is a feasible point of {P_f} and then that {x} is optimal. To prove feasibility, we need to prove that for all {U \subseteq [n]} we have {x(U) \leq f(U)}. We prove for all sets {U \subseteq \{ s_1, \hdots, s_i \}} for all {i} and we do that by induction in {i}. For {i = 0}, it is trivial. Then, supposing we proved for {i}, let’s prove for {i+1}. Take {U \subseteq \{ s_1, \hdots, s_{i+1} \}}. If {s_{i+1} \notin U}, we are done. If not:

\displaystyle \begin{aligned} x(U) & = x(U \setminus \{ s_{i+1} \}) + x(s_{i+1}) \\ & \leq f(U \setminus \{ s_{i+1} \}) + f(\{s_1, \hdots, s_{i+1}\}) - f(\{s_1, \hdots, s_{i}\}) \leq f(U) \end{aligned}

Now, we need to prove optimality and this is essentially a primal-dual argument. The dual of the linear program:

\displaystyle \min \langle w, x\rangle \text{ s.t. } x(U) \leq f(U); x(U) \geq 0; \forall U \subseteq [n]

if the program:

\displaystyle \max \sum_{U \subseteq [n]} f(U) y(U) \text{ s.t. } \sum_{U; i \in U} y(U) \geq w_i; \forall i \in [n]; y(U) \geq 0

Consider the following dual solution: {y(\{s_1, \hdots, s_i\}) = w_i - w_{i+1}} (take {w_{n+1} = 0}) and {y(U) = 0} for all other sets. It is clearly feasible and matches the objective value of the primal solution. Both are therefore optimal. \Box

Notice that the proof has a “matroid flavour” and there is where the name poly-matroid comes from. Note that the rank function of a matroid is a submodular function and that the polymatroid of the rank function is something called the independent set polytope of the matroid. More about it in a later blog post.

Sparsest Cut via Metric Embeddings

June 10th, 2009 No comments

A very powerful technique to design approximation algorithms is called metric embeddings. A lot of problems are easy when the graph has some special structure like a tree or a cycle or when the vertices are encoded as points in a {{\mathbb R}^n}. This gives us more structure to the problem an allows us to make geometric considerations about the object we are studying – and this is sometimes the key to find a good approximation. The basic idea is to start with a space without much structure and try to approximate it by some space with stronger geometric or topological properties. A lot has be done recently in finding cuts in graphs or solving Steiner Tree problems.

Consider the problem of finding the sparsest cut in a graph. Given a graph {G = (V,E)}, we define the sparsity of a {(S, \overline{S})} as:

\displaystyle \alpha(S) = \frac{\vert \partial S \vert}{\min\{ \vert S \vert, \vert V \setminus S \vert \}}

where {\partial S } is the set of edges with one endpoint in {S} and other in {\overline{S})}. The sparsest cut is the cut minimizing the sparsity. A related problem is the flux problem, which asks for a cut minimizing:

\displaystyle f(S) = \frac{\vert \partial S \vert}{ \vert S \vert \cdot \vert V \setminus S \vert}

If we can minimize flux, we have a {2}-approximation for sparsity, since for all {S} we have that: {\frac{\alpha(S)}{\vert V \vert} \leq f(S) \leq \frac{2 \alpha(S)}{\vert V \vert}}. If we can approximate the minimum flux, we can also approximate sparsity by the same factor multiplied by {2}.

Let’s formulate flux as an embeddings problem: we would like a mapping {\phi: V \longrightarrow \{0, 1\}} such that minimizes:

\displaystyle \frac{\sum_{(u,v) \in E} \vert \phi(u) - \phi(v) \vert }{ \sum_{(u,v) \in V^2}  \vert \phi(u) - \phi(v) \vert }

Let’s relax this a bit. Substitue {\vert \phi(u) - \phi(v) \vert} by d_{uv}. So, we want to minimize {\frac{\sum_{(u,v) \in E} d_{uv} }{\sum_{(u,v) \in V^2} d_{uv} }}. A standard trick for minimizing fractions where both numerator and denominator are linear functions on the same variables is to fix the denominator to be {1} and minimize the denominator. So, we want to minimize {\sum_{(u,v) \in E} d_{uv}} given that {\sum_{(u,v) \in V^2} d_{uv} = 1}. We also wanted {d_{uv} \in \{ 0, 1\}}. But this is still not enough, we want {d_{uv} } to be of a specific form called cut metric, i.e., there is a partition of {V} in two sets and {d_{uv} = 1} iff {u} and {v} are in different sets. We relax it a bit by just asking it to be a metric, i.e, non-negative value so that the triangle inequality holds:

\displaystyle d_{uv} \leq d_{uw} + d_{wv}

By relaxing a minimization problem we might get a solution which has actually smaller than the previous problem. This gives however a lower bound to the result. In general, finding a good lower bound is the first step in the analysis of approximation algorithms.

Theorem 1 The solution of the flux problem is lower-bounded by the solution of the following LP:

\displaystyle \left. \begin{aligned} & \min \sum_{(u,v) \in E} d_{uv} \text{ s.t. } \\ & \qquad \left\lbrace \begin{aligned} & \sum_{(u,v) \in V^2} d_{uv} = 1 \\ & d_{uv} \leq d_{uw} + d_{wv} \\ & d_{uv} \geq 0 \end{aligned} \right. \end{aligned} \right. \ \ \ \ \ (1)

The idea for approximating this is to drop the integrality constraint, solve the LP to get a general metric and the round it to obtain a cut. Suppose {\{d_{uv}\}} is the optimal solution to the LP version of the program above. How to get a cut from it?

To do it, we will use a powerful result due to Bourgain, but first, let’s discuss some basics about metrics and embeddinds. A metric space is a pair {(X,d)} where {X} is a set and {d:X^2 \longrightarrow {\mathbb R}_+}. Examples of metric spaces are:

  1. {X} is the set of vertices of a weighted graph {G} and {d(u,v)} is the distance from {u} to {v} in the graph;
  2. {X = {\mathbb R}^n} and {d} is the {L_p}-norm, i.e., {d(u,v) = ( \sum_{i=1}^n \vert u_i - v_i \vert^p )^{1/p}}, when {p = 2}, this is the Euclidian norm, when {p = 1}, it is the Manhattan distance. This space is called {L^n_p} ;
  3. {X = {\mathbb R}^n} and {d} is the {L_\infty}-norm, i.e., {d(u,v) = \max_i \vert u_i - v_i \vert}. It is called {L_\infty} norm, because it {\Vert u - v \Vert_\infty = \lim_{p \rightarrow \infty} \Vert u - v \Vert_p} where {\Vert u - v \Vert_p} denoted the distance in the {L_p} norm. This space is called {L^n_\infty}

Given two metric spaces {(X,d_X)} and {(Y, d_Y)}, we say that a mapping {\phi:X \longrightarrow Y} is a distortion {D} embedding:

\displaystyle d_X (u,v) \leq d_Y(\phi(u), \phi(v)) \leq D d_X (u,v)

to make our life easier, we use most of the times the following (not so) weaker definition: we say it is a distortion {D} embedding if there is some factor {\gamma > 0} such that:

\displaystyle \gamma d_X (u,v) \leq d_Y(\phi(u), \phi(v)) \leq D \gamma d_X (u,v)

this is the same but allowing some scaling.

Bourgain’s Theorem says that:

Theorem 2 (Bourgain) For any metric space {(X,d)} with {|X| = n} there is an embedding into {L_1^{O(\log^2 n)}} with distortion {O(\log n)}. And we can construct this embedding in poly-time using a randomized algorithm.

Supose {\phi} is such embedding. Then we have a function {\phi : V \rightarrow {\mathbb R}^{O(log^2 n)}} such that {d_{uv} \leq \Vert \phi(u) - \phi(v) \Vert \leq d_{uv} \cdot O(\log n)}. Now, this gives us {O(n \log^2 n)} cuts obtained in the following way: for each coordinate {i} of {{\mathbb R}^{O(log^2 n)}}, order the {n} points according to {\phi_i(u)} and for each {j = 1 \ldots n-1} take the cut {S} formed by the first {j} points. Let this be the cut {S_{ij}} Then, take the minimum of those cuts.

Theorem 3 The procedure described above generates a cut within a factor of {O(\log n)} to the optimal in poly-time.

To prove this, first we need some notation. Given the cut {(S_{ij}, \overline{S_{ij}})}, let {d_{S_{ij}}} be the cut-distance associated with it, i.e., {d_{S_{ij}} (u,v)} is {0} if {u} and {v} are in the same side of the cut and {1} otherwise. Let also {\{ x_{i1} \leq x_{i1} \leq \ldots \leq x_{in} \} = \{ \phi_i(v); v \in V \}}. It is not difficult to see that:

\displaystyle \vert \phi_i(u) - \phi_i(v) \vert = \sum_{j=1}^{n-1} (x_{i,j+1} - x_{i,j}) d_{S_{ij}} (u,v)

and, therefore:

\displaystyle \Vert \phi(u) - \phi(v) \Vert_1 = \sum_{i=1}^{O(\log^2 n)} \sum_{j=1}^{n-1} (x_{i,j+1} - x_{i,j}) d_{S_{ij}} (u,v)

Now, we can go ahead and prove the theorem: Let {f^*} be the optimal flux and {f} the flux obtained by the algorithm. We have that:

\displaystyle f^* \leq f = \min_{ij} \frac{ \sum_{(u,v)\in E} d_{S_{ij}(u,v)}}{ \sum_{(u,v)\in V^2} d_{S_{ij}(u,v)}} = \min_{ij} \frac{ c_{ij} \sum_{(u,v)\in E} d_{S_{ij}(u,v)}}{ c_{ij} \sum_{(u,v)\in V^2} d_{S_{ij}(u,v)}}

where {c_{ij} = x_{i,j+1} - x_{i,j}}. Then:

\displaystyle f \leq \frac{\sum_{ij} \left[ c_{ij} \sum_{(u,v)\in E} d_{S_{ij}}(u,v) \right] }{\sum_{ij} \left[ c_{ij} \sum_{(u,v)\in V^2} d_{S_{ij}(u,v)} \right] } = \frac{\sum_{(u,v)\in E} \left[ \sum_{ij} c_{ij} d_{S_{ij}}(u,v) \right] }{ \sum_{(u,v)\in V^2} \left[ \sum_{ij} c_{ij} d_{S_{ij}(u,v)} \right] }

but this is exacly the expression of {L_1}-norm, so:

\displaystyle f \leq \frac{\sum_{(u,v)\in E} \Vert \phi(u) - \phi(v) \Vert_1 }{ \sum_{(u,v)\in V^2} \Vert \phi(u) - \phi(v) \Vert_1 } \leq \frac{\sum_{(u,v)\in E} O(\log n) d_{uv} }{ \sum_{(u,v)\in V^2} d_{uv} }

which is smaller than {f^* O(\log n)} since this is the solution to the LP, which is a relaxed version of the original problem. And this finishes the proof.

This is one of the first applications to metric embeddings. It was introduced in a paper by Linal, London and Rabinovich. There are very nice resources on the web about metric embeddins, as the course by Anupam Gupta at CMU, the course by Michael Goemans at MIT and the course by Tim Roughgarden at Stanford.

This is also an area full of exciting open problems. It is known, for example, that it is hard to decide if a given metric is embeddable in {L_1} or not. But given a metric that we know that is isometrically embeddable in {L_1}, how to find the embedding? This is an open problem. In fact, it is unknown even how to approximate this embedding by a factor better then {\log n}, which is given by Bougain’s Theorem.