An Apple Push Notifications Debugging Story

APNs is a service provided by Apple that allows you to send push notifications to your iOS app's users. To use it, your service connects to an SSL-secured TCP server located at gateway.push.apple.com, port 2195, and you stream a protocol-specific message to the server for each push notification you want to send. You're intended to keep a persistent connection to their service, even when you aren't sending any messages. This is the story of a bug I ran into with my APNs client that led to a [yak-shaving](http://catb.org/jargon/html/Y/yak-shaving.html) session that spanned multiple days. [Apple's documentation](https://developer.apple.com/library/ios/documentation/NetworkingInternet/Conceptual/RemoteNotificationsPG/Chapters/CommunicatingWIthAPS.html#//apple_ref/doc/uid/TP40008194-CH101-SW4) states that they will close the connection whenever you send a faulty message to the service. Before the connection is closed they will send you an *error response packet* describing the exact reason the connection was closed. It also states that they can sometimes actively close the connection for maintenance reasons and in that case will send an error response packet confirming that it was due to normal maintenance. Despite otherwise correct behavior, I noticed that occasionally my APNs connections were being closed on Apple's end so I added some code to actually track the error response packet and log it. I was pretty sure my connections were being closed for maintenance reasons but I wanted to be completely sure. This was my loop (in pseudocode): apn_connection = new_apn_connection() while True: apn_request = get_apn_request() if apn_connection.is_readable(): data = apn_connection.read(6) # error response packet is 6 bytes long if len(data) != 6: raise Exception("error response packet was wrong size!") log_error_response(parse_error_response(data)) break apn_connection.send(serialize_apn_request(apn_request)) I noticed that whenever the `is_readable()` method on the connection object returned true, the `read()` method invocation that followed would return 0 bytes read. This happened each and every time. Two things troubled me. The first thing was that I wasn't receiving the error response packet that APNs promised to send. I searched the web and found no mention of APNs ever not sending this message before closing the connection. The second thing that bothered me was that `is_readable()` was returning true when `read()` was returning 0. Given that my networking code was custom, my initial hypothesis was that my implementations of `is_readable()` and `read()` were wrong and that Apple was indeed sending the error response packet. My first attempt at debugging was to set up a little SSL server on my Mac and connect to it from my Linux machine (presumably an accurate reproduction of the setup). Here's the code for the Mac server (in realcode): def main(argv): a = socket.socket() a.bind(("", 60000)) a.listen(5) b = ssl.wrap_socket(a, server_side=True, keyfile="key", certfile="cert") while True: (c, h) = b.accept() print(h) time.sleep(10) c.send(b"111111") c.close() Unfortunately, the behavior was not reproduced. I correctly got the six bytes in my APNs client. I assumed that my Python server didn't accurately replicate the APNs server setup (after all SSL is a complex beast with multiple modes of operation). I was still suspicious of `is_readable()` and `read()`. For more context here is their implementation (in pseudocode): class SSLSocket(object): def is_readable(self): return SSL_pending(self._ssl) def SSL_pending(ssl): # from OpenSSL source # see ssl_ctrl() in openssl/ssl/bio_ssl.c val = ssl.bytes_in_read_buffer() if not val: return BIO_pending(ssl.read_bio) return val def BIO_pending(bio): # from OpenSSL source # ptr to dynamic io implementation return bio.pending(bio.ctx) def socket_bio_pending(socket_bio): return ioctl(socket_bio.fd, FIONREAD) I at least knew that `is_readable()` was returning true when the strange behavior was occuring and that normally it was returning false, therefore data must have been coming through at some stage of the pipeline. It was just being silently ignored somewhere within `read()`. I figured that the best route to understand what was going on was to more formally debug this, i.e. set a breakpoint after `is_readable()` returns true and step through the implementation of `read()` line-by-line to see first-hand what was happening. I compiled my server without optimizations and installed all the requisite debug packages for the system libraries I was using (libssl, libc, libstdc++). I ran my server, started up the debugger... then GDB immediately hit a segmentation fault. What the heck??? After searching through obscure bug reports on the web I eventually found a comment in an SO thread that said that GDB has been known in the past to run into an infinite loop when trying to demangle certain C++ symbol names. They recommended I run GDB on GDB to confirm. So, I GDB'd GDB and yes I could confirm some type of infinite loop by looking at the backtrace when the segmetation fault was happening. Someone else on the SO thread suggested downloading the latest version of GDB and compiling manually. It was worth a shot. I downloaded GDB's source. Ran `./configure; make`. After a successful compile I then ran the local GDB, attached to my application, and saw a warning about an undemangle-able C++ name that also said the debug session might be incorrect. This was better than a crash and despite the warning I was successfully at a GDB prompt. Only problem was that it wasn't picking up any of the debugging symbols for the system packages!!! Some more web searches later and nothing really came up. Normally people don't use the non-system debugger to debug system packages it seems. In desperation I searched the GDB manual specificallly and found [info on separate debug files](https://sourceware.org/gdb/onlinedocs/gdb/Separate-Debug-Files.html). Basically all I had to do was: (gdb) set debug-file-directory /usr/lib/debug Previously it was set to `/usr/local/lib/debug`, likely based on the `./configure --prefix`, d'oh! After I re-attached to the process I had all my OpenSSL and glibc symbols, great!!! All I had to do was to set the breakpoint and wait for the strange behavior to happen. Since this seemed to be happening randomly and sparsely in time, I added some logging to the APN code and created a little cron job to send me an email whenever the log file contained a special string. Once I received the email I would then know when to return to my computer and start the debugging session. * * * Hours passed and everything was as expected. I got the email, I came back to the GDB session and I was sitting at a breakpoint. Before jumping in, I used `netstat` to check my APNs connection and indeed the socket was in `CLOSE_WAIT` with bytes sitting in the OS receive buffer. I happily stepped through my code while observing its behavior but once I stepped into OpenSSL code I realized that GDB didn't have the OpenSSL sources... I could kind of glean what was going on but it was very tedious since all I could really do was step through assembly and look at my current backtrace. I trudged along anyway and was able to confirm that the code was succesfully reading data off the socket yet ultimately OpenSSL was still returning 0 bytes. At this point it seemed less like this was an unexpected error and more like it was intentional behavior on OpenSSL's part. I started to inspect the OpenSSL source but there were too many code paths to check and by this point my patience was running thin. Instead I just decided to get source debugging working and try again. I did some web searches on "viewing source of system debug packages in gdb ubuntu" and nothing conclusive came up. A random comment in another SO thread just said I had to download the source package and configure GDB manually. Okay... sure, why not? $ apt-get source libssl1.0.0 But I didn't know where GDB was expecting to find the source. I gave up on searching the web at this point, brute force was all I had patience for: $ dpkg-query -L libssl1.0.0-dbg <snip> /usr/lib/debug/.build-id/59/71f1f9a8fdce7edd724e0cc78225fa259b5420.debug </snip> $ objdump -x /usr/lib/debug/.build-id/59/71f1f9a8fdce7edd724e0cc78225fa259b5420.debug | less # nothing useful $ man objdump # some reading $ objdump --dwarf /usr/lib/debug/.build-id/59/71f1f9a8fdce7edd724e0cc78225fa259b5420.debug | less # ugh wtf??? $ objdump --dwarf /usr/lib/debug/.build-id/59/71f1f9a8fdce7edd724e0cc78225fa259b5420.debug | grep / <snip> <15> DW_AT_comp_dir : (indirect string, offset: 0x1cc): /build/buildd/openssl-1.0.1f/engines </snip> If you squint hard enough you'll see that it looks like the OpenSSL sources are expected to be at `/build/buildd/openssl-1.0.1f`. I dug through the GDB manual again and I found that you can [set up source paths](https://sourceware.org/gdb/onlinedocs/gdb/Source-Path.html): (gdb) set substitute-path /build/buildd/openssl-1.0.1f /home/rian/openssl-1.0.1f I set a test breakpoint in my networking code and I confirmed that I was able to step through OpenSSL at the source-level. Thank goodness it worked, I was running out of options. I reset the original breakpoint and waited again. * * * Again, hours passed and I was back at the breakpoint. Just as before the socket was in `CLOSE_WAIT` and there was data sitting in the OS receive buffer. I started to step through the code. Since there was no data in OpenSSL's internal receive buffer (as expected since I perform only a single `read()` call) it had to retrieve an SSL packet from socket buffer (using `ssl3_get_packet()`). The packet returned was of type `SSL3_RT_ALERT`, not `SSL3_RT_APPLICATION_DATA`. The code that retrieves the packet branches on the packet type and it looked like packets with the `SSL3_RT_ALERT` type were handled differently: processing of the packet halts and OpenSSL sets some internal state based on the contents of the "alert" packet. The alert packet in this case was of subtype `SSL_AD_CLOSE_NOTIFY`. Hmm... I did some web searches on "SSL_AD_CLOSE_NOTIFY" and "Ssl close notify packet" and I found that this is a packet that an SSL peer sends when it's about to close the socket. It's not supposed to contain application data since it's an SSL protocol-level packet. Aha! It looked as though APNs was indeed *not* sending me the error response packet! Instead I was only getting an SSL-level close notify packet and then the connection was closed immediately after. This also explained why `is_readable()` would return true: there was data in the OS buffer but not in the SSL buffer and once it was in the SSL buffer it was discarded since it wasn't application data. So... it wasn't an error in my code? I was still highly suspicious. I needed to reproduce this to give me peace. I pulled out the Python server again to see if I could provoke the exact same behavior. I modified it to *not* send the 6 bytes of data and instead just close the connection: def main(argv): a = socket.socket() a.bind(("", 60000)) a.listen(5) b = ssl.wrap_socket(a, server_side=True, keyfile="key", certfile="cert") while True: (c, h) = b.accept() time.sleep(10) print(h) c.close() I ran my application, attached GDB, and set the breakpoint as before. This time `is_readable()` was never returning true. Instead my server would just fail on the `send()` on the APN connection, not a surprise since the connection had been closed. Was my Python server not sending `SSL_AD_CLOSE_NOTIFY`? Quick look into the source of Python's [ssl.py](https://hg.python.org/cpython/file/4f48b1e982ca/Lib/ssl.py): def close(self): if self._makefile_refs < 1: self._sslobj = None socket.close(self) else: self._makefile_refs -= 1 It didn't look like it was calling any special SSL routine to send `SSL_AD_CLOSE_NOTIFY`. It just seemed to be closing the underlying socket. I wondered how to get OpenSSL to send a close notify packet. Another web searched revealed that one can do this by calling `SSL_shutdown()`. Ah, so maybe `SSLSocket.shutdown()` calls `SSL_shutdown()`! def shutdown(self, how): self._checkClosed() self._sslobj = None socket.shutdown(self, how) Hmmm, didn't look like it, just calls the `shutdown()` method of the underlying socket... This seemed wrong. How could there be such a blatant bug in Python's SSL implementation? I had no more time to worry about issues likes that, instead I tried calling `SSL_shutdown()` manually in my mock server: # server accept loop (c, h) = b.accept() time.sleep(10) print(h) c._sslobj.shutdown() # <- new c.close() Yes! It worked. The exact same behavior was produced. `is_readable()` returned True and when I stepped through the subsequent OpenSSL code, it parsed a `SSL_AD_CLOSE_NOTIFY` packet and returned 0 bytes, exactly as it was when I was connected to the real APN service. Everything was starting to match my mental model. Things were becoming peaceful and good again. Still, how was there such a blatant bug in Python's SSL implementation? Another web search didn't pull up anything. Was this yak-shaving session actually going to end up with a tangible positive outcome??? No way. No. Way. I refused to believe it. I went over to [bugs.python.org](http://bugs.python.org/) and searched for "SSL_shutdown ssl" and I found [this bug](http://bugs.python.org/issue17672). No, there was no bug in Python. This was all intentional of course. The idiomatic way to do this is to call `ssl_socket.unwrap().close()`. Wow. Maybe I should have grepped for "shutdown" in ssl.py first. Oh well. For a brief moment I thought of all the people out there who were blissfully and utterly unaware of `SSLSocket.unwrap()`, `SSL_shutdown()` and `SSL_AD_CLOSE_NOTIFY` packets. One last thing was *bugging* me. I hadn't yet proved that my code actually was capable of even receiving APNs' error response packet in the first place. I had no reason to believe it wouldn't be but I didn't want any more uncertainty in this area. I had enough of APNs and SSL and OpenSSL and *fuck all*. I modified the code to garble the device token to provoke a "invalid token" error and yes... APNs sent the error response packet and my code could receive and parse it correctly. That's the end of the story. My code was correct the entire time and Apple's APNs documentation wasn't entirely accurate. While the process of verifying this may have sounded tedious and perhaps also frustrating at times, in retrospect I enjoyed the experience and learned some generally useful things: * `FIONREAD` is a well supported socket `ioctl()` on Windows, Linux, and the BSDs * Writing an SSL server in Python is pretty easy * GDBing GDB is a thing. * How to debug Ubuntu/Debian system packages at the source-level with GDB * I'm now pretty familiar with the OpenSSL source (it gets a bad rap, IMO C is more at fault) * `SSLsocket.unwrap()` is a thing. So is `SSL_Shutdown()`. So are SSL alert packets. Rian Hunter, 7/1/15 <div id="disqus_thread"></div>