Synapse Mishaps
The idea of the Matrix protocol is simple: a federated chat protocol that.. works. And for the most part, this is true! Element, the recommended client within the Matrix ecosystem, looks beautiful.
Of course, nothing is as binary as "works" or "doesn't work". Throughout operation of a homeserver, we have encountered... curious issues. (Do not ask me about the external federation workers incident.) However, for the most part, it functions well.
This is a story about an issue involving synapse, the recommended homeserver implementation for Matrix.
Background
We have a very unconventional setup for our homeserver, awau.uk. The friend I manage the instance with (the wonderful Erisa) is more than familiar with Cloudflare's technologies. She suggested we rely on cloudflared to expose our server to Cloudflare's proxy, and it's worked well. Facilitating the exchange to synapse is old, trusty nginx, allowing us to route by URL to various synapse workers. So far, this has worked.
However, as is the age-old story, an odd issue appeared after many weeks of functionality. For an inexplicable reason, any attempt to join a room on the homeserver externally simply returned an error message:
Invalid signature for server matrix.org with key ed25519:a_RXGa: Unable to verify signature for matrix.org: <class 'nacl.exceptions.BadSignatureError'> Signature was forged or corrupt
As of writing, this issue is so prominent that it has a place on synapse's README. The authors write the following:
This is normally caused by a misconfiguration in your reverse-proxy. See
docs/reverse_proxy.md
and double-check that your settings are correct.
Normally, I would have agreed. However, this has worked perfectly fine previously - for months on end, in fact! Nobody had modified configuration from any side.
So, dear reader, let's debug together. 🐞
A Simple, Five-Step Plan to Fixing Synapse
1. Research the error.
If we snoop through the synapse GitHub issues, we see quite a lot of causes. Erisa noted that many involve something URL decoding halfway through, or appending an extra /
to the request - all things that would invalidate the signature from the remote server. Sounds like we should check for that.
2. See if the URL is reaching our server...?
Good idea! Let's check the logs:
nginx
tells us:
GET /_matrix/federation/v1/make_join/%21xxxxxxxxroom_id_goes_here%3Aawau.uk/%40user_name%3Amatrix.org?ver=1&ver=2&ver=3&ver=4&ver=5&ver=6&ver=7&ver=8&ver=9&ver=org.matrix.msc2176&ver=org.matrix.msc2716v3&ver=org.matrix.msc3787
Authorization: X-Matrix origin="matrix.org",key="ed25519:a_RXGa",sig="xxxxxxxx",destination="awau.uk"
This seems right - notice that components of the URL are still encoded, as we'd expect. Life is good so far.
Next, Erisa had the idea to run tcpdump
within the synapse container to see what's forwarded along.
GET /_matrix/federation/v1/make_join/%21xxxxxxxxroom_id_goes_here%3Aawau.uk/%40user_name%3Amatrix.org?ver=1&ver=2&ver=3&ver=4&ver=5&ver=6&ver=7&ver=8&ver=9&ver=org.matrix.msc2176&ver=org.matrix.msc2716v3&ver=org.matrix.msc3787
Authorization: X-Matrix origin="matrix.org",key="ed25519:a_RXGa",sig="xxxxxxxx",destination="awau.uk"
Hm... it doesn't seem like nginx is tampering with the request. However, we can't exactly inspect matrix.org and see what they're sending.
As part of experimenting, we disabled the Cloudflare proxy and temporarily exposed nginx to the public internet. To our dismay, that continued to have the error.
3. Determine what goes into a request...?????
Wait, yeah... what? This is a GET request, lacking a body (as is standard) - what is being signed here?
Let's delve through the synapse source code to learn a little bit about a request. For this, we'll use tag release-v1.61
.
The first question I want to know is what those ver
query parameters represent. Searching their contents, we find that they are room versions:
: =
Given that we're joining a room, this makes sense.
Tracing through the source, we find synapse/http/matrixfederationclient.py
that houses the majority of logic for inter-server communication. It's here that we find the _send_request
function, used elsewhere to dispatch the GET request with make_join
. This function calls build_auth_headers
, creating our signature!
We know from the above that an authentication header is sent. From some basic searching of X-Matrix
, we find a function named _parse_auth_header
.
It's here that we learn the signature is created by verifying fabricated JSON:
await
Hm... okay. For testing, let's edit awau.uk to print out what it fabricates.
It seems simple - and most importantly, the URL lines up with what we found via nginx
and tcpdump
. Where is the signature being invalidated?
4. Set up your own synapse instance
And so I did such. Via this instance, we can see what message is being signed to awau.uk!
Snooping through references to X-Matrix
lead me to synapse/crypto/keyring.py
.
: =
# [...]
=
By adding basic logging to the request, we obtain the following:
5. Complain about this issue to friends
<spot> ok so what differs
<friend> the order
<friend> :^)
Huh... it does differ, doesn't it? Surely the JSON key order doesn't matter.
It was at this point I pasted the URLs side by side in Visual Studio Code, spacing them out for easier reading.
From synapse.example.com:
?ver=1
&ver=2
&ver=3
&ver=4
&ver=5
&ver=6
&ver=org.matrix.msc2176
&ver=7
&ver=8
&ver=9
&ver=org.matrix.msc2716v3
&ver=org.matrix.msc3787
From awau.uk:
?ver=1
&ver=2
&ver=3
&ver=4
&ver=5
&ver=6
&ver=7
&ver=8
&ver=9
&ver=org.matrix.msc2176
&ver=org.matrix.msc2716v3
&ver=org.matrix.msc3787
Oh no... the URLs differed, and we've neglected to notice that this entire time. The query parameters are sorted alphabetically, instead of by the order in the dictionary!
But... why?
(Optional) 6. Remember that Erisa really likes Cloudflare
Within the past few weeks, Erisa received an offer to experiment with Cloudflare Enterprise features. One such enterprise feature is Query String Sort, an option hidden within caching options.
Since synapse relies fully on including the URL as-is, the signature would be instantly invalidated. I... suppose synapse did warn us about reverse proxies, but that's ridiculous.
It's likely that the earlier attempts to expose nginx directly failed due to DNS record caching.
Conclusion
I am going to be destroying all technical devices and moving to a farm. If you wish to reach out, please do so via smoke signals.