NTIC Primes as Sum of Squares

Section 13.4 Primes as Sum of Squares

In the past few sections, one of the many things you may have conjectured about sums of squares is that every prime of the form

p = 4 k + 1

can be represented as the sum of two squares. Combined with Fact 13.1.9, limiting the question to primes should be sufficient to finish analyzing the question for any positive number. (See Theorem 13.5.5 for the final steps putting this all together.)

🔗

It turns out it is true that

p = 4 k + 1

can always be written as a sum of squares, and we will spend most of the remainder of this chapter proving it. At the end of the chapter, we'll add in Fact 13.1.1 about primes of the form

p = 4 k + 3

to see exactly which numbers can be thus represented.

🔗

Remark 13.4.1.

To keep with the theme of the unity of mathematics, we do this geometrically, not algebraically as in most texts, though the core ideas are similar with both proofs. We roughly follow [E.2.1, Chapter 10.6], but expanded greatly to avoid any direct reference to Hermann Minkowski's theorem on lattice points in a convex symmetric set. Interestingly, [E.4.16, Theorems 4.3 and 8.3] only states this and Lagrange's four square theorem, precisely because although Minkowski's Theorem provides a general framework for existence of such points geometrically, one still requires information about quadratic residues to provide lattice points to work on in the first place.

🔗

Subsection 13.4.1 A useful plot

🔗

First, let's look at the following plot on the integer lattice. As you can see, I am plotting certain points on the circle

x^{2} + y^{2} = n,

with

n = 5

to begin. I have done some ‘magic’ to turn the square root of

- 1 (mod n)

into these points. Before telling you the magic, Figure 13.4.2 (and the interact following it) will help us get ready.

🔗

To be precise, I've used this square root of

- 1

to create the regularly spaced grid of blue points. You can think about it as a bunch of corners of parallelograms.

🔗

Remark 13.4.3.

Sometimes we call things like the set of blue dots a lattice, though in this text I will usually use the word lattice only to refer to the usual integer lattice of the black dots. A general lattice is something related to a concept from linear algebra – vectors generated by a basis, except instead of being vectors over $Q$ or $R,$ they are over $Z .$

🔗

Here is how I constructed the blue grid. First, assume that

p

is our prime and pick

f = (\frac{p - 1}{2})!

as a square root of negative one (or its additive inverse, if you prefer); we can use the residue modulo

p

for convenience. Then the blue points are of the form

(a, a f + b p)

for all integers

a, b .

xxxxxxxxxx
 
@interact
def _(p=(5,[q for q in prime_range(200) if q%4==1])):
    f=mod(factorial((p-1)/2),p)
    viewsize=ceil(math.sqrt(p))+2
    g(x,y)=x^2+y^2
    plot1 = implicit_plot(g-p, (-viewsize,viewsize), (-viewsize,viewsize), plot_points = 100)
    grid_pts = [[i,j] for i in [-viewsize..viewsize] for j in [-viewsize..viewsize]]
    plot_grid_pts = points(grid_pts,rgbcolor=(0,0,0),pointsize=2)
    lattice_pts = [coords for coords in grid_pts if (coords[1]-f*coords[0])%p==0]
    plot_lattice_pts = points(lattice_pts, rgbcolor = (0,0,1),pointsize=20)
    show(plot1+plot_grid_pts+plot_lattice_pts, figsize = [5,5], xmin = -viewsize, xmax = viewsize, ymin = -viewsize, ymax = viewsize, aspect_ratio=1)

🔗

For one final preliminary, let's define one more thing for any old point

(x, y)

in the integer lattice (and especially for our blue dots).

🔗

Definition 13.4.4.

We call the norm of a point $(x, y)$ the sum of squares, $N (x, y) = x^{2} + y^{2} .$

🔗

Subsection 13.4.2 Primes which are sums of squares

🔗

We are now ready to state our big theorem for the section. (See Fact 14.1.8 for a quite different proof.)

🔗

Theorem 13.4.5.

Every prime $p$ of the form $4 k + 1$ can be written as a sum of squares.

🔗

Proof.

The proof is fairly long. Here is the strategy; the first step will be detailed in Subsection 13.4.3 and Subsection 13.4.4.

Suppose we find some blue dot $(a,af+bp)$ such that

\begin{equation*} 0<N(a,af+bp)=a^2+(af+bp)^2<2p\text{.} \end{equation*}

Then we know, modulo $p\text{,}$ that

\begin{equation*} N(a,af+bp)=a^2+(af+bp)^2 \equiv a^2+(af)^2\equiv a^2+a^2 f^2\equiv a^2-a^2\equiv 0\text{ (mod }p)\text{,} \end{equation*}

so $p$ in fact divides the norm of the point $(a,af+bp)\text{.}$

So we have that $0<a^2+(af+bp)^2<2p$ and that $p\mid a^2+(af+bp)^2\text{,}$ meaning the only possibility is $p=a^2+(af+bp)^2\text{,}$ which gives $p$ explicitly written as a sum of perfect squares.

🔗

Example 13.4.6.

For instance, with $p = 5,$ we have that $f = (\frac{5 - 1}{2})! = 2! = 2,$ so we need to find a point $(a, 2 a + 5 b)$ such that

a^{2} + (2 a + 5 b)^{2} < 2 p .

Guess and check with $a = 1$ and $b = 0$ gives us

N (1, 2 \cdot 1 + 5 \cdot 0) = 1^{2} + (2 \cdot 1 + 5 \cdot 0)^{2} = 5 < 2 \cdot 5 = 10

so this point should work, and this does give the correct statement that

5 = 1^{2} + 2^{2} .

🔗

What remains to be shown is that there actually is such a blue dot.

🔗

Subsection 13.4.3 Visualizing the proof

🔗

To prove the theorem that for any

p = 4 k + 1

we can write it as a sum of squares, we need to prove there is a blue dot (somewhere) that is not at the origin but also has norm smaller than

2 p .

We will prove this by heavy reference to graphics, but all claims also make sense algebraically. Sometimes we need help to be able to think about more involved proofs.

🔗

We include a variation on the graphic in Figure 13.4.7 to make this visually clear. The bigger circle is the one we care about now – it has formula

x^{2} + y^{2} = 2 p,

so radius

\sqrt{2 p} .

If we find a blue point inside the disk bounded by that circle, but not at the origin, then the argument in the proof sketch given for Theorem 13.4.5 shows this point must be on the smaller circle.

🔗
Figure 13.4.7. The lattice with the second circle

🔗

Here is an interactive version.

xxxxxxxxxx
 
@interact
def _(p=(5,[q for q in prime_range(200) if q%4==1])):
    f=mod(factorial((p-1)/2),p)
    viewsize=floor(sqrt(2*p))+2
    g(x,y)=x^2+y^2
    plot1 = implicit_plot(g-p, (-viewsize,viewsize), (-viewsize,viewsize), plot_points = 100)
    plot2 = implicit_plot(g-2*p, (-viewsize,viewsize), (-viewsize,viewsize), plot_points = 100)
    grid_pts = [[i,j] for i in [-viewsize..viewsize] for j in [-viewsize..viewsize]]
    plot_grid_pts = points(grid_pts,rgbcolor=(0,0,0),pointsize=2)
    lattice_pts = [coords for coords in grid_pts if (coords[1]-f*coords[0])%p==0]
    plot_lattice_pts = points(lattice_pts, rgbcolor = (0,0,1),pointsize=10)
    show(plot1+plot2+plot_grid_pts+plot_lattice_pts, figsize = [5,5], xmin = -viewsize, xmax = viewsize, ymin = -viewsize, ymax = viewsize, aspect_ratio=1)

🔗

Very strangely, the best way to do this is by considering the areas of the various circles, and showing that they are so big you just must have a blue point in their interior (but not at the origin). Let's see how this works.

🔗

The area of the bigger circle, which has radius

\sqrt{2 p},

π (\sqrt{2 p})^{2} = 2 π p .

Since

π > 2,

we have that

2 π > 2 (2) = 4,

which mean that the area of the bigger circle is bigger than

4 p .

🔗

What we do now is to create a sublattice of the blue dots, which we will color green. (This is just a subset of a lattice which still otherwise satisfies the conditions for being a lattice.) To create the green sublattice, take all blue dots, and just double their coordinates. Naturally, each green dot is still a blue dot, including the origin. See Figure 13.4.8.

🔗
Figure 13.4.8. The lattice with two circles and triangles

🔗

Next, we take a look at certain triangles made by the different colored dots; continue following Figure 13.4.8, or see the interact at the end of this subsection.

🔗

Compare the thinnest such triangles one can form, with respect to the vertical axis.

The thinnest triangle made by blue dots would be of height one. A typical one would have vertices the origin and the points $(p, 0)$ (with $a = p, b = - f$ ) and $(- f, 1)$ (with $a = - f, b = k$ where $f^{2} + 1 = k p$ as above).
The thinnest triangle made by the green dots has height two. It has width $2 p$ (from the origin to $(2 p, 0),$ the previous point doubled); the apex is the point $(- 2 f, 2),$ which is $(- f, 1)$ doubled.

🔗

This triangle has area

4 p / 2 .

(Note that depending on whether

f

is positive or negative this triangle might be above or below the

x

-axis.)

🔗

Now consider the parallelogram with the solid red lines made of two of these triangles – from the origin to

(- 2 f, 2)

(2 p, 0)

(2 p + 2 f, - 2)

and back. (Recall that

f

is a square root of

- 1

modulo

p .

) This quadrilateral has area

4 p,

which means its area is smaller than that of the bigger circle.

🔗

In Figure 13.4.8 we have

p = 5

and

f = - 3 .

To see this all interactively, evaluate the interact; click triangles_on to see the green dot triangle and parallelogram outlined in red.

xxxxxxxxxx
 
@interact
def _(p=(5,[q for q in prime_range(200) if q%4==1]), triangles_on=False):
    f=mod(factorial((p-1)/2),p)
    viewsize=2*p
    g(x,y)=x^2+y^2
    plot1 = implicit_plot(g-p, (-viewsize,viewsize), (-viewsize,viewsize), plot_points = 100)
    plot2 = implicit_plot(g-2*p, (-viewsize,viewsize), (-viewsize,viewsize), plot_points = 100)
    plot3 = line([[0,0], [2*p-2*Integer(f),2], [2*p,0], [2*Integer(f),-2], [0,0]], rgbcolor=(1,0,0))
    plot4 = line2d([[0,0], [2*p,0]], rgbcolor=(1,0,0), linestyle='--')
    grid_pts = [[i,j] for i in [-viewsize..viewsize] for j in [-viewsize..viewsize]]
    plot_grid_pts = points(grid_pts,rgbcolor=(0,0,0),pointsize=2)
    lattice_pts = [coords for coords in grid_pts if (coords[1]-f*coords[0])%p==0]
    plot_lattice_pts = points(lattice_pts, rgbcolor = (0,0,1),pointsize=10)
    plot_lattice_pts2 = points([[2*coords[0],2*coords[1]] for coords in lattice_pts], rgbcolor = (0,1,0),pointsize=20)
    if triangles_on:
        show(plot1+plot2+plot3+plot4 + plot_grid_pts + plot_lattice_pts+plot_lattice_pts2, xmin = -viewsize/2, xmax = viewsize, ymin = -viewsize/2, ymax = viewsize/2, aspect_ratio=1)
    else:
        show(plot1+plot2+plot_grid_pts + plot_lattice_pts + plot_lattice_pts2, xmin = -viewsize/2, xmax = viewsize, ymin = -viewsize/2, ymax = viewsize/2, aspect_ratio=1)

🔗

The last stage of the proof is very visual. Before we move on, make sure you believe all the claims of this stage, especially the claims about areas. Those are the ones we will analyze more closely to finish the proof of Theorem 13.4.5. Remember always that we are trying to prove that there is a blue point contained inside the disk bounded by the bigger blue circle, but away from the origin.

🔗

Subsection 13.4.4 Finishing the proof

🔗

Let's take stock.

We've created circles of various sizes to find points in, and two lattices to examine.
The area of the circle is more than the area ( $4 p$ ) of the smallest parallelogram made by green dots.

🔗

To finish the proof, we need to find a blue point other than the origin interior to the bigger blue circle of radius

\sqrt{2 p} .

The gist of the argument splits into two parts.

🔗

First, we will pursue Claim 13.4.11:

Because all points inside the parallelogram (not just green, blue, or lattice points) will “repeat” outside of it in another parallelogram, $4 p$ is the biggest area of a region that you can have and not “repeat” some point. (This parallelogram is often called a fundamental region in more general treatments.)
So, the interior of the circle, having a bigger area, must have two points (not necessarily blue points, just points on the plane) which are “repeated” by translation of this parallelogram.

🔗

We will expand on exactly what “repeat” means momentarily.

🔗

Secondly, we show why the previous claim leads to a proof in Claim 13.4.12:

We start with the two points from Claim 13.4.11 in the disk bounded by the circle (points which are not necessarily on any lattice, blue, green, or even black).
Then we use elementary geometry to construct a blue point (namely, one of the form $(a, a f + b p)$ ) which is strictly in the interior of the disk bounded by the circle of radius $\sqrt{2 p} .$ In particular, this point is not the origin.

🔗

The argument in Theorem 13.4.5 now finishes the proof.

🔗

Let's begin the final push to prove the two claims with a fact and a definition which explain what sort of points we are looking for.

🔗

Fact 13.4.9.

Let $L$ be the parallelogram with vertices $(0, 0),$ $(- 2 f, 2),$ $(2 p, 0),$ and $(2 p + 2 f, - 2)$ and its interior (where $f$ is a square root of $- 1$ modulo $p$ ). Any plane region is the union of its intersection with all possible translations of $L$ by rigidly moving $L$ so that the origin is translated to another green point.

🔗

Proof.

We are not going to prove topological facts in this text, nor explore the further depths of lattices. So it suffices to note that every green point $(2a,2af+2bp)$ can serve as the leftmost vertex of a unique parallelogram not just congruent to, but translated from, $L\text{,}$ and that by construction these cannot overlap (other than possibly along their edges).

🔗

Definition 13.4.10.

We say that two distinct points $v, w$ in a plane region are “repeated” if they are both rigid translations of the same point in $L,$ where the allowed translations are those described in Fact 13.4.9.

🔗

We now prove the two remaining claims to finish the proof of Theorem 13.4.5, after which we encourage the reader to explore the large interact in Example 13.4.13 which ends the section.

🔗

Claim 13.4.11.

Consider the circle of radius $\sqrt{2 p}$ centered at the origin. The interior of the disk bounded by this circle has two points “repeated” by shifting the parallelogram $L .$

🔗

Proof.

Recall from Fact 13.4.9 that the disk is composed of all its intersections with different parallelograms congruent to $L\text{.}$

Suppose that there are not two points “repeated” within the disk (not including the boundary circle). Then every point thereof is a translation of a different point of $L\text{.}$ One can make this a one-to-one function from the disk to $L$ by sending each point in the disk to the corresponding one in $L\text{.}$

Because each such move is rigid, this function is area-preserving³, which means the area of the disk must be less than or equal to that of $L\text{.}$

If you looked at this footnote because you want a proof of this, recall we do not prove topological facts in this text! Next you'll be wanting a proof of the Jordan curve theorem from first principles. More seriously, we have to draw the line somewhere, and I find pedagogically that students would find proving assertions of this kind similar to proving $1+1=2$ using Russell and Whitehead as a text. Convincing students that proving Fact 1.2.2 is useful is hard enough.

However, at the end of Subsection 13.4.3 we asserted the opposite! So by way of contradiction we have our two points.

🔗

Claim 13.4.12.

Given two points $v, w$ (in the interior of the circle of radius $\sqrt{2 p}$ centered at the origin) which “repeat” from $L,$ we can construct a point, not the origin, of the form $(a, a f + b p) .$

🔗

Proof.

Given how we defined “repetition”, we know that the line segments from $v$ and $w$ to the leftmost vertex of their respective translations of $L$ must themselves be rigid translations of each other, hence the line segment connecting $v$ and $w$ can be translated to a segment connecting the origin and another green point. Give this point the name⁴ $v-w\text{.}$

In fact, as vectors of course this is the point, but we minimize formal use of vectors in this text.

Since $v-w$ is of the form $(2a,2af+2bp)$ by definition, then the point halfway between it and the origin (or “$(v-w)/2$”) is a blue point of the form $(a,af+bp)\text{,}$ and clearly not the origin since $v-w$ itself is not the origin. It remains to show that this blue point is in the interior of the circle.

To see this, consider the distance $d$ between $v$ and $w\text{.}$ By definition of a circle, it cannot possibly be further than twice the radius, so $d$ is strictly less than $2\sqrt{2p}\text{.}$ But then $v-w$ cannot be more than $d$ units from the origin, so the point $(a,af+bp)\text{,}$ being exactly half that distance from the origin, is less than distance $\sqrt{2p}$ to the origin. By definition $(a,af+bp)$ is in the interior of the larger circle, as desired.

🔗

Example 13.4.13.

In Figure 13.4.14 we see the picture of how Claims 13.4.11 and 13.4.12 find the blue point in the circle. The black points are $v$ and $w,$ the arrows point between $v$ and $w$ and from the origin to $v - w,$ and the midpoint of the second arrow is indeed blue.

Figure 13.4.14. How to find the lattice point on the circle

🔗

Sage note 13.4.15. Examining code is good for you.

The next Sage cell makes Figure 13.4.14 interactive. But don't just use it to view the proof for other primes; examine the code itself.

This is by far the longest code we've seen up to this point. It is a brute force check of all movements of all points in the parallelogram to find two points in the bigger circle. Can you think of ways to make it more efficient?

xxxxxxxxxx
 
@interact
def _(p=(5,[q for q in prime_range(200) if q%4==1])):
    f=Integer(mod(factorial((p-1)/2),p))
    big = math.floor(math.sqrt(2*p))
    viewsize=2*p
    g(x,y)=x^2+y^2
    plot1 = implicit_plot(g-p, (-viewsize,viewsize), (-viewsize,viewsize), plot_points = 100)
    plot2 = implicit_plot(g-2*p, (-viewsize,viewsize), (-viewsize,viewsize), plot_points = 100)
    plot3 = line([[0,0], [2*p-2*f,2], [2*p,0], [2*f,-2], [0,0]], rgbcolor=(1,0,0))
    plot4 = line2d([[0,0],[2*p,0]], rgbcolor=(1,0,0), linestyle='--')
    grid_pts = [[i,j] for i in [-viewsize..viewsize] for j in [-viewsize..viewsize]]
    plot_grid_pts = points(grid_pts,rgbcolor=(0,0,0),pointsize=2)
    lattice_pts = [coords for coords in grid_pts if (coords[1]-f*coords[0])%p==0]
    plot_lattice_pts = points(lattice_pts, rgbcolor = (0,0,1),pointsize=10)
    big_lattice_pts = [[2*coords[0],2*coords[1]] for coords in lattice_pts]
    plot_lattice_pts2 = points(big_lattice_pts, rgbcolor = (0,1,0),pointsize=20)
    w = []
    v = []
    mw = []
    for i in [1..2*p-1]:
        for coords in [l for l in big_lattice_pts if l!=[0,0]]:
            if (i+coords[0])^2+(coords[1]-1)^2 < 2*p:
                for coords2 in [k for k in big_lattice_pts if k!=[0,0] and k!=coords]:
                    if (i+coords2[0])^2 + (coords2[1]-1)^2 < 2*p:
                        w = [i+coords[0],coords[1]-1]
                        v = [i+coords2[0],coords2[1]-1]
                        vmw = [v[0]-w[0],v[1]-w[1]]
                        break
            if w: break
        if w: break
    if not v:
        for i in [j for j in [f..p+f]]:
            for coords in [l for l in big_lattice_pts if l!=[0,0]]:
                if (i+coords[0])^2+(coords[1]-1)^2 < 2*p:
                    for coords2 in [k for k in big_lattice_pts if k!=[0,0] and k!=coords]:
                        if (i+coords2[0])^2 + (coords2[1]-1)^2 < 2*p:
                            w = [i+coords[0],coords[1]-1]
                            v = [i+coords2[0],coords2[1]-1]
                            vmw = [v[0]-w[0],v[1]-w[1]]
                            break
                if w: break
            if w: break
    if not v:
        for i in [j for j in [p-f..2*p-f]]:
            for coords in [l for l in big_lattice_pts if l!=[0,0]]:
                if (i+coords[0])^2+(coords[1]+1)^2 < 2*p:
                    for coords2 in [k for k in big_lattice_pts if k!=[0,0] and k!=coords]:
                        if (i+coords2[0])^2+(coords2[1]+1)^2 < 2*p:
                            w = [i+coords[0],coords[1]+1]
                            v = [i+coords2[0],coords2[1]+1]
                            vmw = [v[0]-w[0],v[1]-w[1]]
                            break
                if w: break
            if w: break
    P1=point(v,pointsize=20,rgbcolor=(0,0,0))
    P2=point(w,pointsize=20,rgbcolor=(0,0,0))
    Z=point(vmw,pointsize=20,rgbcolor=(0,0,0))
    plot4 = arrow(w,v,rgbcolor=(0,0,0), thickness=1, linestyle='--', arrowsize=3)
    plot5 = arrow((0,0),vmw,rgbcolor=(0,0,0), thickness=1, linestyle='--', arrowsize=3)
    plot6 = point((vmw[0]/2,vmw[1]/2),pointsize=30)
    show(plot1+plot2+plot3+plot4 + P1+P2+Z+plot4+plot5+plot6 + plot_grid_pts + plot_lattice_pts + plot_lattice_pts2, figsize = [5,5], xmin = -viewsize/2, xmax = viewsize, ymin = -viewsize/2, ymax = viewsize/2, aspect_ratio=1)

🔗

Believe it or not, we've concluded the proof – whew!

🔗

Why was this so hard? I can think of three reasons.

First, we are trying to prove something about squares by proving something about square roots. It works, but it means there will be many steps.
Secondly, we are not just algebraically proving it exists by solving an equation; we are forced to prove our square root exists with inequalities, which brings another set of complications.
Third, we chose to examine those inequalities geometrically to gain insight, so our proofs must use that insight – worthwhile, but stretching.

🔗

Historical remark 13.4.16. Hermann Minkowski.

Many more theorems of this kind, such as Lagrange's four square theorem, can be proved using similar techniques, which we are intentionally avoiding stating in their full generality. The names of Minkowski and Blichfeldt are associated with theorems using various symmetries and the notion of convexity in order to apply things more generally. Those who have had some physics may have heard of Minkowski before, as his work nearly beat Einstein to the notion of special relativity; his geometric framework for space-time gave Einstein the necessary apparatus to generalize to curved spacetime and general relativity.