Elsevier

Neural Networks

Volume 137, May 2021, Pages 119-126
Neural Networks

The Kolmogorov–Arnold representation theorem revisited

https://doi.org/10.1016/j.neunet.2021.01.020Get rights and content
Under a Creative Commons license
open access

Abstract

There is a longstanding debate whether the Kolmogorov–Arnold representation theorem can explain the use of more than one hidden layer in neural networks. The Kolmogorov–Arnold representation decomposes a multivariate function into an interior and an outer function and therefore has indeed a similar structure as a neural network with two hidden layers. But there are distinctive differences. One of the main obstacles is that the outer function depends on the represented function and can be wildly varying even if the represented function is smooth. We derive modifications of the Kolmogorov–Arnold representation that transfer smoothness properties of the represented function to the outer function and can be well approximated by ReLU networks. It appears that instead of two hidden layers, a more natural interpretation of the Kolmogorov–Arnold representation is that of a deep neural network where most of the layers are required to approximate the interior function.

Keywords

Kolmogorov–Arnold representation theorem
Function approximation
Deep ReLU networks
Space-filling curves

1. Introduction

Why are additional hidden layers in a neural network helpful? The Kolmogorov–Arnold representation (KA representation in the following) seems to offer an answer to this question as it shows that every continuous function can be represented by a specific network with two hidden layers (Hecht-Nielsen, 1987). But this interpretation has been highly disputed. Articles discussing the connection between both concepts have titles such as “Representation properties of networks: Kolmogorov’s theorem is irrelevant” (Girosi & Poggio, 1989) and “Kolmogorov’s theorem is relevant” (Kurkova, 1991).

The original version of the KA representation theorem states that for any continuous function f:[0,1]dR, there exist univariate continuous functions gq, ψp,q such that (1.1)f(x1,,xd)=q=02dgq(p=1dψp,q(xp)).This means that the (2d+1)(d+1) univariate functions gq and ψp,q are enough for an exact representation of a d-variate function. Kolmogorov published the result in 1957 disproving the statement of Hilbert’s 13th problem that is concerned with the solution of algebraic equations. The earliest proposals in the literature introducing multiple layers in neural networks date back to the sixties and the link between KA representation and multilayer neural networks occurred much later.

A ridge function is a function of the form f(x)=p=1mgp(wpx), with vectors wpRd and univariate functions gp. The structure of the KA representation can therefore be viewed as the composition of two ridge functions. There exists no exact representation of continuous functions by ridge functions and matching upper and lower bounds for the best approximations are known (Gordon et al., 2002, Maiorov, 1999, Maiorov et al., 1999). The composition structure is thus essential for the KA representation. A two-hidden-layer feedforward neural network with activation function σ, hidden layers of width m1 and m2, and one output unit can be written in the form f(x)=q=1m1dqσ(p=1m2bpqσ(wpx+ap)+cq),with parameterswpRd,ap,bpq,cq,dqR.Because of the similarity between the KA representation and neural networks, the argument above suggests that additional hidden layers can lead to unexpected features of neural network functions.

There are several reasons why the Kolmogorov–Arnold representation theorem has been initially declared as irrelevant for neural networks in Girosi and Poggio (1989). The original proof of the KA representation in Kolmogorov (1957) and some later versions are non-constructive providing very little insight on how the function representation works. Although the ψp,q are continuous, they are still rough functions sharing similarities with the Cantor function. Meanwhile more refined KA representation theorems have been derived strengthening the connection to neural networks (Braun and Griebel, 2009, Sprecher, 1965, Sprecher, 1996, Sprecher, 1997). Maiorov and Pinkus (1999) showed that the KA representation can essentially be rewritten in the form of a two-hidden-layer neural network for a non-computable activation function σ, see the literature review in Section 4 for more details. The following KA representation is much more explicit and practical.

Theorem 1

Theorem 2.14 in Braun, 2009

Fix d2. There are real numbers a,bp,cq and a continuous and monotone function ψ:RR, such that for any continuous function f:[0,1]dR, there exists a continuous function g:RR with f(x1,,xd)=q=02dg(p=1dbpψ(xp+qa)+cq).

This representation is based on translations of one inner function ψ and one outer function g. The inner function ψ is independent of f. The dependence on q in the first layer comes through the shifts qa. The right hand side can be realized by a neural network with two hidden layers. The first hidden layer has d units and activation function ψ and the second hidden layer consists of 2d+1 units with activation function g.

For a given 0<β1, we will assume that the represented function f is β-smooth, which here means that there exists a constant C, such that |f(x)f(y)|Cxyβ for all x,y[0,1]d. Let m>0 be arbitrary. To approximate a β-smooth function up to an error mβ, it is well-known that standard approximation schemes need at least of the order of md parameters. This means that any efficient neural network construction mimicking the KA representation and approximating β-smooth functions up to error mβ should have at most of the order of md many network parameters.

Starting from the KA representation, the objective of the article is to derive a deep ReLU network construction that is optimal in terms of number of parameters. For that reason, we first present novel versions of the KA representation that are easy to prove and also allow to transfer smoothness from the multivariate function to the outer function. In Section 3 the link is made to deep ReLU networks.

The efficiency of the approximating neural network is also the main difference to the related work (Kurkova, 1992, Montanelli and Yang, 2020). Based on sigmoidal activation functions, the proof of Theorem 2 in Kurkova (1992) proposes a neural network construction based on the KA representation with two hidden layers and dm(m+1) and m2(m+1)d hidden units to achieve approximation error of the order of mβ. This means that more than m4+d network weights are necessary, which is sub-optimal in view of the argument above. The very recent work (Montanelli & Yang, 2020) uses a modern version of the KA representation that guarantees some smoothness of the interior function. Combined with the general result on function approximation by deep ReLU networks in Yarotsky (2018), a rate is derived that depends on the smoothness of the outer function via the function class KC([0,1]d;R), see p. 4 in Montanelli and Yang (2020) for a definition. The non-trivial dependence of the outer function on the represented function f makes it difficult to derive explicit expressions for the approximation rate if f is β-smooth. Moreover, as the KA representation only guarantees low regularity of the interior function, it remains unclear whether optimal approximation rates can be obtained.

Although the deep ReLU network proposed in Shen, Yang, and Zhang (2020) is not motivated by the KA representation or space-filling curves, the network construction is quite similar. Section 4 contains a more detailed comparison with this and other related approaches.

2. New versions of the KA representation

The starting point of our work is the apparent connection between the KA representation and space-filling curves (Gorchakov and Mozolenko, 2019, Sprecher and Draghici, 2002). A space-filling curve γ is a surjective map [0,1][0,1]d. This means that it hits every point in [0,1]d and thus “fills” the cube [0,1]d. Known constructions are based on iterative procedures producing fractal-type shapes. If γ1 exists, we could then rewrite any function f:[0,1]dR in the form (2.1)f=(fγ)gγ1.This would decompose the function f into a function γ1:Rd[0,1] that can be chosen to be independent of f and a univariate function g=fγ:[0,1]R containing all the information of the d-variate function f. Compared to the KA representation, there are two differences. Firstly, the interior function γ1 isd-variate and not univariate. Secondly, by Netto’s theorem (Kupers), a continuous surjective map [0,1][0,1]2 cannot be injective and γ1 does not exist. The argument above can therefore not be made precise for arbitrary dimension d and a continuous space-filling curve γ.

To illustrate our approach, we first derive a simple KA representation based on (2.1) and with γ1 an additive function. The identity avoids the continuity of the functions ψ and g, which is the major technical obstacle in the proof of the KA representation. The proof does moreover not require that the represented function f is continuous.

Lemma 1

Fix integers d,B2. There exists a monotone function ψ:[0,1]R such that for any function f:[0,1]dR, we can find a function g:RR with (2.2)f(x1,,xd)=g(p=1dBpψ(xp)).

Proof

The B-adic representation of a number is not unique. For the decimal representation, 1 is for instance the same as 1=0.999 To avoid any problems that this may cause, we select for each real number x[0,1] one B-adic representation x=j1Bjajx with ajx{0,,B1}. Throughout the following, it is often convenient to rewrite x in its B-adic expansion. Set x=j=1ajxBj[0.a1xa2xa3x]Band define the function ψ(x)=j=1ajxBd(j1).The function ψ is monotone and maps x to a number with B-adic representation [a1x.00(d1)-timesa2x00(d1)-timesa3x0]Binserting always d1 zeros between the original B-adic digits of x. Multiplication by Bp shifts moreover the digits by p places to the right. From that we obtain the B-adic representation (2.3)Ψ(x1,,xd)p=1dBpψ(xp)=[0.a1x1a1x2a1xda2x1]BBecause we can recover x1,,xd from Ψ(x1,,xd), the map Ψ is invertible. Denote the inverse by Ψ1. We can now define g=fΨ1 and this proves the result. □

The proof provides some insights regarding the structure of the KA representation. Although one might find the construction of Ψ:[0,1]d[0,1] in the proof very artificial, a substantial amount of neighborhood information persists under Ψ. Indeed, points that are close are often mapped to nearby values. If for instance x1,x2[0,1]d are two points coinciding in all components up to the kth B-adic digit, then, Ψ(x1) and Ψ(x2) coincide up to the kd-th B-adic digit. In this sense, the KA representation can be viewed as a two step procedure, where the first step Ψ does some extreme dimension reduction. Compared to low-dimensional random embeddings which by the Johnson–Lindenstrauss lemma nearly preserve the Euclidean distances among points, there seems, however, to be no good general characterization of how the interior function changes distances.

The function Ψ is discontinuous at all points with finite B-adic representation. The map Ψ defines moreover an order relation on [0,1]d via x<y:Ψ(x)<Ψ(y). For B=d=2, the inverse map Ψ1 is often called the Morton order and coincides, up to a rotation of 90 degrees, with the z-curve in the theory of space-filling curves (Bader, 2013 Section 7.2).

If f is a piecewise constant function on a dyadic grid, the outer function g is also piecewise constant. As a negative result, we show that for this representation, smoothness of f does not translate into smoothness on g.

Lemma 2

Let k be a positive integer. Consider representation (2.2) for B=2 and let g be as in the proof of Lemma 1.

  • (i)

    If f:RdR is piecewise constant on the 2kd hypercubes ×j=1d(j2k,(j+1)2k), with 1,,d{0,2k1}, then g is a piecewise constant function on the intervals (2kd,(+1)2kd), =0,,2kd1.

  • (ii)

    If f(x)=x, then g is discontinuous.

Proof

(i) If x(2kd,(+1)2kd), we can write x=Δ+2kd with 0<Δ<2kd. There exist thus 1,,d{0,2k1}, such that Ψ1(x)=Ψ1(Δ)+(12k,,d2k). Since Ψ1(Δ)(0,2k)××(0,2k), the result follows from g=fΨ1.

(ii) If f is the identity, g=fΨ1=Ψ1. For x12, we find that Ψ1(x)(12,1,1,,1) and for x12, Ψ1(x)(12,0,0,,0). Even stronger, every point with finite binary representation is a point of discontinuity. □

The discontinuity of the space-filling map Ψ1 causes g to be more irregular than f. Many constructions of space-filling curves are known but to obtain a representation of KA type Ψ needs to be an additive function. The additivity condition rules out most of the canonical choices, such as for instance the Hilbert curve. Below, we use for Ψ1 the Lebesgue curve and show that this then leads to a representation that allows to transfer smoothness properties of f to smoothness properties on g and therefore overcomes the shortcomings of the representation in (2.2). In contrast to the earlier result, g is now a function that maps from the Cantor set, in the following denoted by C, to the real numbers.

Theorem 2

For fixed dimension d2, there exists a monotone function ϕ:[0,1]C (the Cantor set) such that for any function f:[0,1]dR, we can find a function g:CR such that

  • (i)

    (2.4)f(x1,,xd)=g(3p=1d3pϕ(xp));

  • (ii)

    if f:[0,1]dR is continuous, then also g:CR is continuous;

  • (iii)

    if there exist β1 and a constant Q, such that |f(x)f(y)|Q|xy|β, for all x,y[0,1]d, then, |g(x)g(y)|2βQ|xy|βlog2dlog3,for allx,yC;

  • (iv)

    if f(x)=x, then, there exist sequences (xk)k,(yk)kC with limkxk=limkyk and |g(xk)g(yk)|=(|xkyk|2)log2dlog3.

Proof

The construction of the interior function is similar as in the proof of Lemma 1. We associate with each x[0,1] one binary representation x=[0.a1xa2x]2 and define (2.5)ϕ(x)j=12ajx31+d(j1)=[0.(2a1x)00(d1)-times(2a2x)00(d1)-times]3.The function ϕ multiplies the binary digits by two (thus, only the values 0 and 2 are possible) and then expresses the digits in a ternary expansion adding d1 zeros between each two digits. By construction, the Cantor set consists of all y[0,1] that only have 0 and 2 as digits in the ternary expansion. This shows that ϕ:[0,1]C. Define now (2.6)Φ(x1,,xd)3p=1d3pϕ(xp)=[0.(2a1x1)(2a1x2)(2a1xd)(2a2x1)]3where the right hand side is written in the ternary system. Because we can recover the binary representation of x1,,xd, the map Φ is invertible. Since 2axr{0,2} for all 1 and r{1,,d}, the image of Φ is contained in the Cantor set. We can now define the inverse by Φ1:CR and set g=fΦ1:CR, proving (i).

In the next step of the proof, we show that (2.7)|Φ1(x)Φ1(y)|2|xy|log2(dlog3),for allx,yC,For that we extend the proof in Bader (2013, p. 98). Observe that Φ1 maps x=[0.x1x2x3]3 to the vector([0.(x12)(xd+12)]2,,[0.(xd2)(x2d2)]2)[0,1]d.Given arbitrary x,yC, k=k(x,y) denotes the integer k for which 3(k+1)d|xy|<3kd. Suppose that the first kd ternary digits of x and y are not all the same and denote by J the position of the first digit of x that is not the same as y. Since only the digits 0 and 2 are possible, the difference between x and y can be lower bounded by |xy|23J3J, where the term 3J accounts for the effect of the later digits. Thus |xy|3J and this is a contradiction with |xy|<3kd and Jkd. Thus, the first kd ternary digits of x and y coincide. Using the explicit form of Φ1 this also implies that Φ1(x) and Φ1(y) coincide in the first k binary digits in each component. This means that |Φ1(x)Φ1(y)|2k and together with the definition of k, we find (2.8)|Φ1(x)Φ1(y)|22(k+1)=2(3(k+1)d)log2dlog32|xy|log2dlog3proving (2.7), since x,yC were arbitrary. Using again that g=fΦ1, (ii) and (iii) follow.

To prove (iv), take xk=0 and yk=23kd. Then, Φ1(xk)=(0,,0) and Φ1(yk)=(0,,0,2k). Rewriting this yields |g(xk)g(yk)|=|Φ1(xk)Φ1(yk)|=(|xkyk|2)log2dlog3 for all k1.  □

Thus, by restricting to the Cantor set, one can overcome the limitations of Netto’s theorem mentioned at the beginning of the section. In fact by construction and (2.8), γ=Φ1 is a surjective, invertible and continuous space-filling curve. The previous theorem is in a sense more extreme than the KA representation as the univariate interior function maps to a set of Hausdorff dimension log2log3<1Siegelmann and Sontag (1994) use a similar construction to prove embeddings of the function spaces generated by circuits into neural network function classes.

Representation (2.4) has the advantage that smoothness imposed on f translates into smoothness properties on g. The reason is that the function Φ associates to each x[0,1]d one value in the Cantor set such that values that are far from each other in [0,1]d are not mapped to nearby values in the Cantor set. Based on this value in the Cantor set, the outer function g reconstructs the function value f(x). Since the distance of the values in [0,1]d is linked to the distance of the values in the Cantor set, local variability in the function f does not lead to arbitrarily large fluctuations of the outer function g. Therefore smoothness imposed on f translates into smoothness properties on g.

A natural question is whether we gain or lose something if instead of approximating f directly, we use (2.4) and approximate g. Recall that the approximation rate should be mβ if md is the number of free parameters of the approximating function, β the smoothness and d the dimension. Since g is by (iii) α-smooth with α=βlog2(dlog3) and is defined on a set with Hausdorff dimension d=log2log3, we see that there is no loss in terms of approximation rates since βd=αd. Thus, we can reduce multivariate function approximation to univariate function approximation on the Cantor set. This, however, only holds for β1. Indeed, the last statement of the previous theorem means that for the smooth function f(x)=x, the outer function g is not more than βlog2(dlog3)-smooth, implying that for higher order smoothness, there seems to be a discrepancy between the multivariate and univariate function approximation.

The only direct drawback of (2.4) compared to the traditional KA representation is that the interior function ϕ is discontinuous. We will see in Section 3 that ϕ can, however, be well approximated by a deep neural network.

It is also of interest to study the function class containing all f that are generated by the representation in (2.4) for β-smooth outer function g. Observe that if g(x)=x, then f coincides with the interior function which is discontinuous. This shows that for β1, the class of all f of the form (2.4) with g a βlog2(dlog3)-smooth function on the Cantor set C is strictly larger than the class of β-smooth functions. Interestingly, the function class with Lipschitz continuous outer function g contains all functions that are piecewise constant on a dyadic partition of [0,1]d.

Lemma 3

Consider representation (2.4) and let k be a positive integer. If f:[0,1]dR is piecewise constant on the 2kd hypercubes ×j=1d[j2k,(j+1)2k), with 1,,d{0,2k1}, then g is a Lipschitz function with Lipschitz constant bounded by 2f3kd.

Proof

Let ϕ and Φ be the same as in the proof of Theorem 2. For any vector a=(a1,,akd){0,2}kd define I(a)={[0.a1akdb1b2]3:b1,b2,{0,2}}. There exist integers 1,,d{0,,2k1} such that Φ1(I(a))×j=1d[j2k,(j+1)2k). Since f is constant on these dyadic hypercubes, g(I(a))=(fΦ1)(I(a))= const. If a,a˜{0,2}kd and aa˜, then, arguing as in the proof of Theorem 2, we find that |xy|3kd whenever xI(a) and yI(a˜). Therefore, we have |g(x)g(y)|=0 if x,yI(a) and |g(x)g(y)|2g2f2f3kd|xy| if xI(a) and yI(a˜). Since a,a˜ were arbitrary, the result follows. □

It is important to realize that the space-filling curves and fractal shapes occur because of the exact identity. It is natural to wonder whether the KA representation leads to an interesting approximation theory. For that, one wants to truncate the number of digits in (2.5), hence reducing the complexity of the interior function. We obtain an approximation bound that only depends on the dimension d through the smoothness of the outer function g.

Lemma 4

Let d2 and suppose K is a positive integer. For x=[0.a1xa2x]2, define ϕK(x)j=1K2ajx31d(j1). If there exist β1 and a constant Q, such that |f(x)f(y)|Q|xy|β, for all x,y[0,1]d, then, we can find a univariate function g, such that |g(x)g(y)|2βQ|xy|βlog2dlog3, for all x,yC, and |f(x)g(3p=1d3pϕK(xp))|2Q2βK,for allx=(x1,,xd)[0,1]d.Moreover, fL([0,1]d)=gL(C).

Proof

From the geometric sum formula, q=03q=32. Let ϕ and g be as in Theorem 2 and ϕK as defined in the statement of the lemma. Since β1, we then have that |g(x)g(y)|2Q|xy|βlog2dlog3, for all x,yC. Moreover, (2.6) shows that 3p=1d3pϕ(xp) and 3p=1d3pϕK(xp) are both in the Cantor set C and have the same first Kd ternary digits. Thus, using (2.4), we find |f(x)g(3p=1d3pϕK(xp))|=|g(3p=1d3pϕ(xp))g(3p=1d3pϕK(xp))|2Q|3p=1d3p(ϕ(xp)ϕK(xp))|βlog2dlog32Q|2q=Kd+13q|βlog2dlog32Q|23dK1q=03q|βlog2dlog32Q3Kβlog2log3. Finally, fL([0,1]d)=gL(C) follows as an immediate consequence of the function representation (2.4). □

3. Deep ReLU networks and the KA representation

This section studies the construction of deep ReLU networks imitating the KA approximation in Lemma 4. A deep/multilayer feedforward neural network is a function xf(x) that can be represented by an acyclic graph with vertices arranged in a finite number of layers. The first layer is called the input layer, the last layer is the output layer and the layers in between are called hidden layers. We say that a deep network has architecture (L,(p0,,pL+1)), if the number of hidden layers is L, and p0, pj and pL+1 are the number of vertices in the input layer, jth hidden layer and output layer, respectively. The input layer of vertices represents the input x. For all other layers, each vertex stands for an operation of the form yσ(ay+b) with y the output (viewed as vector) of the previous layer, a a weight vector, b a shift parameter and σ the activation function. Each vertex has its own set of parameters (a,b) and also the activation function does not need to be the same for all vertices. If for all vertices in the hidden layers the ReLU activation function σ(x)=max(x,0) is used and L>1, the network is called a deep ReLU network. As common for regression problems, the activation function in the output layer will be the identity.

  1. Download : Download high-res image (206KB)
  2. Download : Download full-size image

Fig. 1. (Left) A deep neural network with K hidden layers and width three computing the function x=[0.a1xa2x]23ϕK(x)=j=1K2ajx3d(j1) exactly. In each hidden layer the linear activation function is applied to the left and right unit. The units in the middle use the threshold activation function σ(x)=1(x12). (Right) A deep ReLU network approximating the function ϕK. For the definitions of Sr and Tr see the proof of Theorem 3.

Approximation properties of deep neural networks for composed functions are studied in Bauer and Kohler, 2019, Fan et al., 2020, Horowitz and Mammen, 2007, Kohler and Krzyżak, 2017, Mhaskar and Poggio, 2016, Nakada and Imaizumi, 2020, Poggio et al., 2017 and Schmidt-Hieber, 2019, Schmidt-Hieber, 2020. These approaches do, however, not lead to straightforward constructions of ReLU networks exploiting the specific structure of the KA approximation (3.1)f(x)g(3p=1d3pϕK(xp)).in Lemma 4. To find such a construction, recall that the classical neural network interpretation of the KA representation associates the interior function with the activation function in the first layer (Hecht-Nielsen, 1987). Here, we argue that the interior function can be efficiently approximated by a deep ReLU network. The role of the hidden layers is to retrieve the next bit in the binary representation of the input. Interestingly, some of the proposed network constructions to approximate β-smooth functions use a similar idea without making the link to the KA representation, see Section 4 for more details.

Fig. 1 gives the construction of a network computing x=[0.a1xa2x]23ϕK(x)=j=1K2ajx3d(j1) combining units with linear activation function σ(x)=x and threshold activation function σ(x)=1(x12). The main idea is that for x=[0.a1xa2x]2, we can extract the first bit using a1x=1(x12)=σ(x) and then define 2x2σ(x)=2(xa1x)=[0.a2xa3x]2. Iterating the procedure allows us to extract a2x and consequently any further binary digit of x. The deep neural network DNN I in Fig. 1 has K hidden layers and network width three. The left units in the hidden layer successively build the output value; the units in the middle extract the next bit in the binary representation and the units on the right compute the remainder of the input after bit extraction. To learn the bit extraction algorithm, deep networks lead obviously to much more efficient representations compared to shallow networks.

Constructing d networks computing ϕK(xp) for each x1,,xd and combining them yields a network with K+1 hidden layers and network width 3d, computing the interior function(x1,,xd)3p=1d3pϕK(xp) in (3.1). The overall number of non-zero parameters is of the order Kd. To approximate a β-smooth function f by a neural network via the KA approximation (3.1), the interior step makes the approximating network deep but uses only very few parameters compared to the approximation of the univariate function g.

A close inspection of the network DNN I in Fig. 1 shows that all linear activation functions get non-negative input and can therefore be replaced by the ReLU activation function without changing the outcome. The threshold activation functions σ(x)=1(x12) can be arbitrarily well approximated by the linear combination of two ReLU units via ε1(x(1ε)2)+ε1(x(1+ε)2)+1(x12) for ε0. If one accepts potentially huge network parameters, the network DNN I in Fig. 1 can therefore be approximated by a deep ReLU network with K hidden layers and network width four. Consequently, also the construction in (3.1) can be arbitrarily well approximated by deep ReLU networks. It is moreover possible to reduce the size of the network parameters by inserting additional hidden layers in the neural network, see for instance Proposition A.3 in Elbrächter, Perekrestenko, Grohs, and Bölcskei (2019).

Throughout the following we write fpfLp([0,1]d).

Theorem 3

Let p[1,). If there exist β1 and a constant Q, such that |f(x)f(y)|Q|xy|β, for all x,y[0,1]d, then, there exists a deep ReLU network f˜ with 2K+3 hidden layers, network architecture (2K+3,(d,4d,,4d,d,1,2Kd+1,1)) and all network weights bounded in absolute value by 2(Kdf)2K(d(pβ)), such that ff˜p2(Q+f)2βK.

Proof

The proof consists of four parts. In part (A) we construct a ReLU network mimicking the approximand constructed in Lemma 4. For that we first build a ReLU network with architecture (2K,(1,4,,4,1)) imitating the function x=[0.a1xa2x]23ϕK(x)=j=1K2ajx3d(j1). In part (B), it is shown that the ReLU network approximation coincides with the function 3ϕK on a subset of [0,1]d with Lebesgue measure 12Kβp. In part (C), we construct a neural network approximation for the outer function g in Lemma 4. The approximation error is controlled in part (D).

(A): Let r be the largest integer such that 2r2Kd2Kβp and set S1(x)2r(x12+2r1)+2r(x122r1)+ and T1(x)2x. Given Sj(x),Tj(x), we can then define (3.2)Tj+1(x)(2Tj(x)2Sj(x))+,Sj+1(x)S1(Tj(x)Sj(x)).There exists a ReLU network with architecture (1,(1,2,1)) and all network weights bounded in absolute value by 2r computing the function xS1(x). Similarly, there exists a ReLU network with architecture (1,(2,2,1)) computing (Sj(x),Tj(x))Sj+1(x)=S1(Tj(x)Sj(x)). Since S1(x)0, we have that (Sj(x))+=Sj(x) and Tj(x)=(Tj(x))+. Because of that, we can now concatenate these networks as illustrated in Fig. 1 to construct a deep ReLU network computing xj=1K2Sj(x)3d(j1). Recall that computing Sj+1(x) from (Sj(x),Tj(x)) requires an extra layer with two nodes that is not shown in Fig. 1. Thus, any arrow, except for the ones pointing to the output, adds one additional hidden layer to the ReLU network. The overall number of hidden layers is thus 2K. Because of the two additional nodes in the non-displayed hidden layers, the width in all hidden layers is four and thus the overall architecture of this deep ReLU network is (2K,(1,4,,4,1)). By checking all edges, it can be seen that all network weights are bounded by 2r2Kd2Kβp.

(B): Recall that x=[0.a1xa2x]2. We now show that on a large subset of the unit interval, it holds that Sj(x)=ajx and Tj(x)=[ajx.aj+1xaj+2x]2 for all j=1,,K and therefore also j=1K2Sj(x)3d(j1)=j=1K2ajx3d(j1)=3ϕK(x).

We have that S1(x)=1(x>12), whenever |x12|2r1. Set Aj,r{x:|[0.ajxaj+1x]212|2r1}. If xAj,r, then, S1([0.ajxaj+1x]2)=ajx. Thus, if Sj(x)=ajx, Tj(x)=[ajx.aj+1xaj+2x]2, and xAj+1,r, then, (3.2) implies Sj+1(x)=aj+1x and Tj+1(x)=[aj+1x.aj+2xaj+3x]2. Hence, the deep ReLU network constructed in part (A) computes the function [0,1]x3ϕK(x) exactly on the set j=1KAj,r.

For fixed a1x,,aj1x{0,1}, the set{x:|[0.ajxaj+1x]212|<2r1} is an interval of length 2rj+1. As there are 2j1 possibilities to choose a1x,,aj1x{0,1}, the complement Aj,rc={x[0,1]:xAj,r} can be written as the union of 2j1 subintervals of length 2rj+1. The Lebesgue measure of Aj,rc is therefore bounded by 2r. Since Kd2Kβp2r, we find that (j=1KAj,r)c has Lebesgue measure bounded by K2r2Kβpd. This completes the proof for part (B).

(C): Now we construct a shallow ReLU network interpolating the outer function g in Lemma 4 at the 2Kd+1 points {j=1Kd2tj3j:(t1,,tKd){0,1}Kd}{1}. Denote these points by 0s0<s1<<s2Kd1<s2Kd1. For any x[0,1], g˜(x)g(s0)+j=12Kdg(sj)g(sj1)sjsj1((xsj1)+(xsj)+)=g(s0)(x+1)++(g(s1)g(s0)s1s0g(s0))(x)++j=12Kd1(g(sj+1)g(sj)sj+1sjg(sj)g(sj1)sjsj1)(xsj)+. The function g˜(x) can therefore be represented on [0,1] by a shallow ReLU network with 2Kd+1 units in the hidden layer. Moreover, g˜(sj)=g(sj) for all j=0,,2Kd. Finally, we bound the size of the network weights. We have sj+1sj3Kd. By Lemma 4, f=gL(C). Since 0sj1 and for any positive a, a(xsj)+=a(axasj)+, we conclude that all network weights can be chosen to be smaller than 2f2Kd.

  1. Download : Download high-res image (77KB)
  2. Download : Download full-size image

Fig. 2. Construction of the deep ReLU network in part (D) of the proof for Theorem 3.

(D): Fig. 2 shows how the neural networks ϕ˜K and g˜ can be combined into a deep ReLU network with architecture (2K+3,(d,4d,,4d,d,1,2Kd+1,1)) and all network weights bounded in absolute value by max(2f2Kd,2Kd2Kβp) computing the function f˜(x1,,xd)g˜(3q=1d3qϕ˜K(xq)). Since 3q=1d3qϕK(xq){s0,,s2Kd}, the interpolation property g˜(sj)=g(sj) implies that g˜(3q=1d3qϕK(xq))=g(3q=1d3qϕK(xq)). Together with (B), we conclude that f˜(x1,,xd)=g˜(3q=1d3qϕ˜K(xq))=g(3q=1d3qϕK(xq)),ifx1,,xdj=1KAj,r.As shown in Lemma 4, f=gL(C). Since g˜ is a piecewise linear interpolation of g, we also have g˜L([0,1])f. As shown in (B), the Lebesgue measure of (j=1KAj,r)c is bounded by 2Kβpd. Decomposing the integral and using the approximation bound in Lemma 4, ff˜ppi:xij=1KAj,r|f(x)g(3q=1d3qϕK(xp))|pdx+i:xij=1KAj,r2pfpdx2pQp2βKp+2pfp2Kβp2p(Q+f)p2Kβp, using for the last inequality that ap+bp(a+b)p for all p1 and all a,b0.  □

Recall that for a function class with md parameters, the expected optimal approximation rate for a β-smooth function in d dimensions is mβ. The previous theorem leads to the rate 2Kβ using of the order of 2Kd network parameters. This coincides thus with the expected rate. In contrast to several other constructions, no network sparsity is required to recover the rate. It is unclear whether the construction can be generalized to higher order smoothness or anisotropic smoothness.

The function approximation in Lemma 4 is quite similar to tree-based methods in statistical learning. CART or MARS, for instance, selects a partition of the input space by making successive splits along different directions and then fits a piecewise constant (or piecewise linear) function on the selected partition (Hastie, Tibshirani, & Friedman, 2009 Section 9.2). The KA approximation is also piecewise constant and the interior function assigns a unique value to each set in the dyadic partition. Enlarging K refines the partition. The deep ReLU network constructed in the proof of Theorem 3 imitates the KA approximation and also relies on a dyadic partition of the input space. By changing the network parameters in the first layers, the unit cube [0,1]d can be split into more general subsets and similar function systems as the ones underlying MARS or CART can be generated using deep ReLU networks, see also Eckle and Schmidt-Hieber (2019) and Kohler, Krzyzak, and Langer (2019).

As typical for neural network constructions that decompose function approximation into a localization and a local approximation step, the deep ReLU network in Theorem 3 only depends on the represented function f via the weights in the last hidden layer. As a consequence, one could use this deep ReLU network construction to initialize stochastic gradient descent. For that, it is natural to sample the weights in the output layer from a given distribution and assign all other network parameters to the corresponding value in the network construction. A comparison with standard network initializations will be addressed in future work.

The fact that in the proposed network construction only the output layer depends on the represented function matches also with the observation that in deep learning a considerable amount of information about the represented function is decoded in the last layer. This is exploited in pre-training where a trained deep network from a different classification problem is taken and only the output layer is learned by the new dataset, see for instance Zeiler and Fergus (2014). The fact that pre-training works show that deep networks build rather generic function systems in the first layers. For real datasets, the learned parameters in the first hidden layers still exhibit some dependence on the underlying problem and transfer learning updating all weights based on the new data outperforms pre-training (He, Girshick, & Dollar, 2019).

4. Related literature

The section is intended to provide a brief overview of related approaches.

Shen et al. (2020) propose a similar deep ReLU network construction without making a link to the KA representation or space-filling curves. The similarity between both approaches can be best seen in their Fig. 5 or the outline of the proof for Theorem 2.1 in Section 3.2. Indeed, in a first step, the input space is partitioned into smaller hypercubes that are enumerated. The first hidden layers map the input to the index of the hypercube. This localization step is closely related to the action of the interior function used here. The last hidden layers of the deep ReLU network perform a piecewise linear approximation and this is essentially the same as the implementation of the outer function in the modified KA representation in this paper. To ensure good smoothness properties, Shen et al. (2020) also include gaps in the indexing, that fulfill a similar role as the gaps in the Cantor set here. Lu, Shen, Yang, and Zhang (2020) combine the approach with local Taylor expansions and achieves optimal approximation rates for functions that are smoother than Lipschitz.

Another direction is to search for activation function with good representation property based on the modified KA representation in Maiorov and Pinkus (1999). Their Theorem 4 states that one can find a real analytic, strictly increasing, and sigmoidal activation function σ, such that for any continuous function f:[0,1]dR and any ε>0, there exist parameters wpqRd,apq,bpq,cq,dqR, satisfying supx[0,1]d|f(x)q=16d+3dqσ(p=13dbpqσ(wpqx+apq)+cq)|<ε.This removes the dependence of the outer activation function in the KA representation on the represented function f. The main issue is that despite its smoothness properties, the activation function σ is not computable and it is unclear how to transfer the result to popular activation functions such as the ReLU. One step in this direction has been done in the recent work Guliyev and Ismailov (2018) proving that one can design computable activation functions with complexity increasing as ε0Shen, Yang, and Zhang (2020) show that for a neural network with three hidden layers and three explicit and relatively simple but non-differentiable activation functions one can achieve extremely fast approximation rates.

The fact that deep networks can do bit encoding and decoding efficiently has been used previously in Bartlett, Harvey, Liaw, and Mehrabian (2019) to prove (nearly) sharp bounds for the VC dimension of deep ReLU networks and also in Yarotsky (2018) and Yarotsky and Zhevnerchuk (2019) for a different construction to obtain approximation rates of very deep networks with fixed with. There are, however, several distinctive differences. These works employ bit encoding to compress several function values in one number (see for instance Section 5.2.1 in Yarotsky, 2018), while we apply bit extraction to the input vector. In our approach the bit extraction leads to a localization of the input space and the function values only enters in the last hidden layer. It can be checked that in our construction the weight assignment is continuous, that is, small changes in the represented function will lead to small changes in the network weights. On the contrary, bit encoding in the function values results in discontinuous weight assignment and it is known that this is unavoidable for efficient function approximation based on very deep ReLU networks (Yarotsky, 2018).

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

The research has been supported by the Dutch STAR network and a Vidi grant from the Dutch science organization (NWO), The Netherlands . This work was done while the author was visiting the Simons Institute for the Theory of Computing. The constructive comments and suggestions shared by the associate editor and the three referees resulted in a significantly improved version of the article. The author wants to thank Matus Telgarsky for helpful remarks and pointing to the article Siegelmann and Sontag (1994).