Returning to work at the beginning of the week, I learned that one of our clients had an issue that was believed to be reputation based. Their customers were having problems sending messages to a variety of large email providers, specifically hotmail.com, aol.com and yahoo.com. As you may or may not know, “IP Reputation” is a very important thing for a mail server to maintain. If you end up with a poor reputation, many email servers will not accept mail from you. For an ISP, this can be very bad and can generate a large amount of costly support calls.
As the week marched on, it was evident that this was not your regular reputation issue. At this point I took ownership of this and decided to dig a little bit deeper. An analysis of the mail log file indicated that the problem appeared to be intermittent in nature; sometimes messages went through and sometimes the sending server would get disconnected from the recipient server due to a timeout.
Typically if a mail server does not trust your IP address, it will respond with a
5xx SMTP code (permanent failure) and a short message. Similarly, if a server is rate limiting your IP they will either block your connections completely or respond with a
4xx SMTP code (temporary failure) and a short message.
This no longer looked like a reputation issue given the intermittent nature to me; the reasons being were that the recipient servers were disconnecting due to timeout, no
4xx errors and the fact that this was occurring across multiple recipient domains and MX servers. Furthermore, we had not heard of any similar reports from any other customers.
At this stage I smelled a network issue. However, I really do not like to simply cry “networking issue” unless I know with certainty that the problem is not being caused by our product (there is a bit of personal pride involved as well).
This is not a tutorial on how to use tcpdump, so I am going to get right into the analysis.
What happened is the server connected successfully and SMTP commenced as expected the
RCPT TO and
DATA commands were all sent successfully and received successfully. The server was sending mail data when suddenly it was no longer receiving TCP
ACK responses back from the recipient server. I could see from the dump that retransmissions were attempted until finally the remote server closed the connection due to a timeout.
This was good and I could tell that the problem did not exist on our end! After consideration there were only two possibilities to consider:
ACKresponses from the recipient mail server.
I knew at this point that the problem was not on our end of things, but I did not want to leave the client hanging and I was curious as well as to what was going on.
I sent the packet capture over to the client and explained the situation to them. I asked if they could do a similar packet capture at their edge router so that I could try to isolate the problem. Depending on their findings, I would be able to reliably determine whether the problem was inside or outside their network.
Admittedly, it is a little hard to understand what is going on here without the full capture. But for hopefully obvious reasons I hope you understand why I could not post that. Instead, I will do my best to explain.
What the edge capture showed is that the router could not see the DATA transmissions from the sending server! This means that the problem was definitely between the sending server and the edge. To explain further, the reason the sending server never received
ACK responses is because the recipient server never even got the DATA transmissions.
The client and I discussed if any unique equipment was between the sending server and the edge. There was only one possibility, this mail server cluster had a load balancer positioned in front and the mail servers were sending all outbound traffic back through the load balancer.
I performed a simple test and configured one of the servers so that it did not send mail through the load balancer.
Success! The receiving server gave a
2xx response, the sending server sent the SMTP
QUIT command and the TCP
ACK sequences finished correctly.
Packet analysis is a very helpful tool when you really need to dig in to a problem. It definitely is not at the top of my troubleshooting toolbox, but I do make a point of keeping in practice with it. In case anybody is curious, I used tcpdump, wireshark and tshark for the above.
This also highlights the importance of being familiar with the OSI model and networking. For more information, I recommend settling into the following for a bit of light reading - https://tools.ietf.org/html/rfc793 😂