APNs is a service provided by Apple that allows you to send push
notifications to your iOS app's users. To use it, your service
connects to an SSL-secured TCP server located at
gateway.push.apple.com, port 2195, and you stream a protocol-specific
message to the server for each push notification you want to
send. You're intended to keep a persistent connection to their
service, even when you aren't sending any messages.
This is the story of a bug I ran into with my APNs client that led to
a [yak-shaving](http://catb.org/jargon/html/Y/yak-shaving.html)
session that spanned multiple days.
[Apple's
documentation](https://developer.apple.com/library/ios/documentation/NetworkingInternet/Conceptual/RemoteNotificationsPG/Chapters/CommunicatingWIthAPS.html#//apple_ref/doc/uid/TP40008194-CH101-SW4)
states that they will close the connection whenever you send a faulty
message to the service. Before the connection is closed they will send
you an *error response packet* describing the exact reason the
connection was closed. It also states that they can sometimes actively close
the connection for maintenance reasons and in that case will send an
error response packet confirming that it was due to normal
maintenance.
Despite otherwise correct behavior, I noticed that occasionally my
APNs connections were being closed on Apple's end so I added some code
to actually track the error response packet and log it. I was pretty
sure my connections were being closed for maintenance reasons but I
wanted to be completely sure. This was my loop (in pseudocode):
apn_connection = new_apn_connection()
while True:
apn_request = get_apn_request()
if apn_connection.is_readable():
data = apn_connection.read(6) # error response packet is 6 bytes long
if len(data) != 6: raise Exception("error response packet was wrong size!")
log_error_response(parse_error_response(data))
break
apn_connection.send(serialize_apn_request(apn_request))
I noticed that whenever the `is_readable()` method on the connection
object returned true, the `read()` method invocation that followed
would return 0 bytes read. This happened each and every time.
Two things troubled me. The first thing was that I wasn't receiving
the error response packet that APNs promised to send. I searched the
web and found no mention of APNs ever not sending this message before
closing the connection. The second thing that bothered me was that
`is_readable()` was returning true when `read()` was returning
0. Given that my networking code was custom, my initial hypothesis was
that my implementations of `is_readable()` and `read()` were wrong and
that Apple was indeed sending the error response packet.
My first attempt at debugging was to set up a little SSL server on my
Mac and connect to it from my Linux machine (presumably an accurate
reproduction of the setup). Here's the code for the Mac server (in
realcode):
def main(argv):
a = socket.socket()
a.bind(("", 60000))
a.listen(5)
b = ssl.wrap_socket(a, server_side=True, keyfile="key", certfile="cert")
while True:
(c, h) = b.accept()
print(h)
time.sleep(10)
c.send(b"111111")
c.close()
Unfortunately, the behavior was not reproduced. I correctly got the
six bytes in my APNs client. I assumed that my Python server didn't
accurately replicate the APNs server setup (after all SSL is a complex
beast with multiple modes of operation).
I was still suspicious of `is_readable()` and `read()`. For more
context here is their implementation (in pseudocode):
class SSLSocket(object):
def is_readable(self):
return SSL_pending(self._ssl)
def SSL_pending(ssl):
# from OpenSSL source
# see ssl_ctrl() in openssl/ssl/bio_ssl.c
val = ssl.bytes_in_read_buffer()
if not val:
return BIO_pending(ssl.read_bio)
return val
def BIO_pending(bio):
# from OpenSSL source
# ptr to dynamic io implementation
return bio.pending(bio.ctx)
def socket_bio_pending(socket_bio):
return ioctl(socket_bio.fd, FIONREAD)
I at least knew that `is_readable()` was returning true when the
strange behavior was occuring and that normally it was returning false,
therefore data must have been coming through at some stage of the
pipeline. It was just being silently ignored somewhere within
`read()`. I figured that the best route to understand what was going on
was to more formally debug this, i.e. set a breakpoint after
`is_readable()` returns true and step through the implementation of
`read()` line-by-line to see first-hand what was happening.
I compiled my server without optimizations and installed all the
requisite debug packages for the system libraries I was using (libssl,
libc, libstdc++). I ran my server, started up the debugger... then GDB
immediately hit a segmentation fault. What the heck???
After searching through obscure bug reports on the web I eventually
found a comment in an SO thread that said that GDB has been known in
the past to run into an infinite loop when trying to demangle certain
C++ symbol names. They recommended I run GDB on GDB to confirm. So, I GDB'd
GDB and yes I could confirm some type of infinite loop
by looking at the backtrace when the segmetation fault was
happening. Someone else on the SO thread suggested downloading the
latest version of GDB and compiling manually. It was worth a shot.
I downloaded GDB's source. Ran `./configure; make`. After a successful
compile I then ran the local GDB, attached to my application, and saw
a warning about an undemangle-able C++ name that also said the debug
session might be incorrect. This was better than a crash and despite
the warning I was successfully at a GDB prompt. Only problem was that
it wasn't picking up any of the debugging symbols for the system
packages!!!
Some more web searches later and nothing really came up. Normally
people don't use the non-system debugger to debug system packages it
seems. In desperation I searched the GDB manual specificallly and
found [info on separate debug
files](https://sourceware.org/gdb/onlinedocs/gdb/Separate-Debug-Files.html). Basically
all I had to do was:
(gdb) set debug-file-directory /usr/lib/debug
Previously it was set to `/usr/local/lib/debug`, likely based on the
`./configure --prefix`, d'oh!
After I re-attached to the process I had all my OpenSSL and glibc
symbols, great!!! All I had to do was to set the breakpoint and wait
for the strange behavior to happen. Since this seemed to be happening
randomly and sparsely in time, I added some logging to the APN code
and created a little cron job to send me an email whenever the log
file contained a special string. Once I received the email I would
then know when to return to my computer and start the debugging
session.
* * *
Hours passed and everything was as expected. I got the email, I
came back to the GDB session and I was sitting at a breakpoint. Before
jumping in, I used `netstat` to check my APNs connection and indeed
the socket was in `CLOSE_WAIT` with bytes sitting in the OS receive
buffer. I happily stepped through my code while observing its behavior
but once I stepped into OpenSSL code I realized that GDB didn't have
the OpenSSL sources... I could kind of glean what was going on but it
was very tedious since all I could really do was step through assembly
and look at my current backtrace. I trudged along anyway and was able
to confirm that the code was succesfully reading data off the socket
yet ultimately OpenSSL was still returning 0 bytes. At this point it
seemed less like this was an unexpected error and more like it was
intentional behavior on OpenSSL's part.
I started to inspect the OpenSSL source but there were too many code
paths to check and by this point my patience was running thin. Instead
I just decided to get source debugging working and try again.
I did some web searches on "viewing source of system debug packages in
gdb ubuntu" and nothing conclusive came up. A random comment
in another SO thread just said I had to download the source package
and configure GDB manually. Okay... sure, why not?
$ apt-get source libssl1.0.0
But I didn't know where GDB was expecting to find the source. I gave
up on searching the web at this point, brute force was all I had
patience for:
$ dpkg-query -L libssl1.0.0-dbg
/usr/lib/debug/.build-id/59/71f1f9a8fdce7edd724e0cc78225fa259b5420.debug
$ objdump -x /usr/lib/debug/.build-id/59/71f1f9a8fdce7edd724e0cc78225fa259b5420.debug | less # nothing useful
$ man objdump # some reading
$ objdump --dwarf /usr/lib/debug/.build-id/59/71f1f9a8fdce7edd724e0cc78225fa259b5420.debug | less # ugh wtf???
$ objdump --dwarf /usr/lib/debug/.build-id/59/71f1f9a8fdce7edd724e0cc78225fa259b5420.debug | grep /
<15> DW_AT_comp_dir : (indirect string, offset: 0x1cc): /build/buildd/openssl-1.0.1f/engines
If you squint hard enough you'll see that it looks like the OpenSSL
sources are expected to be at `/build/buildd/openssl-1.0.1f`. I dug
through the GDB manual again and I found that you can [set up source
paths](https://sourceware.org/gdb/onlinedocs/gdb/Source-Path.html):
(gdb) set substitute-path /build/buildd/openssl-1.0.1f /home/rian/openssl-1.0.1f
I set a test breakpoint in my networking code and I confirmed that I
was able to step through OpenSSL at the source-level. Thank goodness
it worked, I was running out of options.
I reset the original breakpoint and waited again.
* * *
Again, hours passed and I was back at the breakpoint. Just as before the
socket was in `CLOSE_WAIT` and there was data sitting in the OS
receive buffer. I started to step through the code. Since there was no
data in OpenSSL's internal receive buffer (as expected since I perform
only a single `read()` call) it had to retrieve an SSL packet from
socket buffer (using `ssl3_get_packet()`). The packet returned was of
type `SSL3_RT_ALERT`, not `SSL3_RT_APPLICATION_DATA`. The code that
retrieves the packet branches on the packet type and it looked like
packets with the `SSL3_RT_ALERT` type were handled differently:
processing of the packet halts and OpenSSL sets some internal state
based on the contents of the "alert" packet. The alert packet in this
case was of subtype `SSL_AD_CLOSE_NOTIFY`. Hmm...
I did some web searches on "SSL_AD_CLOSE_NOTIFY" and "Ssl close notify
packet" and I found that this is a packet that an SSL peer sends when
it's about to close the socket. It's not supposed to contain
application data since it's an SSL protocol-level packet. Aha! It
looked as though APNs was indeed *not* sending me the error response
packet! Instead I was only getting an SSL-level close notify packet
and then the connection was closed immediately after. This also
explained why `is_readable()` would return true: there was data in the
OS buffer but not in the SSL buffer and once it was in the SSL buffer
it was discarded since it wasn't application data.
So... it wasn't an error in my code? I was still highly suspicious. I
needed to reproduce this to give me peace. I pulled out the Python
server again to see if I could provoke the exact same behavior. I
modified it to *not* send the 6 bytes of data and instead just close
the connection:
def main(argv):
a = socket.socket()
a.bind(("", 60000))
a.listen(5)
b = ssl.wrap_socket(a, server_side=True, keyfile="key", certfile="cert")
while True:
(c, h) = b.accept()
time.sleep(10)
print(h)
c.close()
I ran my application, attached GDB, and set the breakpoint as
before. This time `is_readable()` was never returning true. Instead my
server would just fail on the `send()` on the APN connection, not a
surprise since the connection had been closed. Was my Python server
not sending `SSL_AD_CLOSE_NOTIFY`? Quick look into the source of
Python's
[ssl.py](https://hg.python.org/cpython/file/4f48b1e982ca/Lib/ssl.py):
def close(self):
if self._makefile_refs < 1:
self._sslobj = None
socket.close(self)
else:
self._makefile_refs -= 1
It didn't look like it was calling any special SSL routine to send
`SSL_AD_CLOSE_NOTIFY`. It just seemed to be closing the underlying
socket. I wondered how to get OpenSSL to send a close notify
packet. Another web searched revealed that one can do this by calling
`SSL_shutdown()`. Ah, so maybe `SSLSocket.shutdown()` calls
`SSL_shutdown()`!
def shutdown(self, how):
self._checkClosed()
self._sslobj = None
socket.shutdown(self, how)
Hmmm, didn't look like it, just calls the `shutdown()` method of the
underlying socket... This seemed wrong. How could there be such a
blatant bug in Python's SSL implementation? I had no more time to
worry about issues likes that, instead I tried calling
`SSL_shutdown()` manually in my mock server:
# server accept loop
(c, h) = b.accept()
time.sleep(10)
print(h)
c._sslobj.shutdown() # <- new
c.close()
Yes! It worked. The exact same behavior was produced. `is_readable()`
returned True and when I stepped through the subsequent OpenSSL code,
it parsed a `SSL_AD_CLOSE_NOTIFY` packet and returned 0 bytes, exactly
as it was when I was connected to the real APN service. Everything
was starting to match my mental model. Things were becoming peaceful
and good again.
Still, how was there such a blatant bug in Python's SSL
implementation? Another web search didn't pull up anything. Was this
yak-shaving session actually going to end up with a tangible positive
outcome??? No way. No. Way. I refused to believe it. I went over to
[bugs.python.org](http://bugs.python.org/) and searched for
"SSL_shutdown ssl" and I found [this
bug](http://bugs.python.org/issue17672).
No, there was no bug in Python. This was all intentional of
course. The idiomatic way to do this is to call
`ssl_socket.unwrap().close()`. Wow. Maybe I should have grepped for
"shutdown" in ssl.py first. Oh well. For a brief moment I thought of
all the people out there who were blissfully and utterly unaware of
`SSLSocket.unwrap()`, `SSL_shutdown()` and `SSL_AD_CLOSE_NOTIFY`
packets.
One last thing was *bugging* me. I hadn't yet proved that my code
actually was capable of even receiving APNs' error response packet in
the first place. I had no reason to believe it wouldn't be but I
didn't want any more uncertainty in this area. I had enough of APNs
and SSL and OpenSSL and *fuck all*. I modified the code to garble the
device token to provoke a "invalid token" error and yes... APNs sent
the error response packet and my code could receive and parse it
correctly.
That's the end of the story. My code was correct the entire time and
Apple's APNs documentation wasn't entirely accurate. While the process
of verifying this may have sounded tedious and perhaps also
frustrating at times, in retrospect I enjoyed the experience and
learned some generally useful things:
* `FIONREAD` is a well supported socket `ioctl()` on Windows, Linux, and the BSDs
* Writing an SSL server in Python is pretty easy
* GDBing GDB is a thing.
* How to debug Ubuntu/Debian system packages at the source-level with GDB
* I'm now pretty familiar with the OpenSSL source (it gets a bad rap, IMO C is more at fault)
* `SSLsocket.unwrap()` is a thing. So is `SSL_Shutdown()`. So are SSL alert packets.
Rian Hunter, 7/1/15