Recently I had to track down a very peculiar problem. We were experiencing a number of issues, that seemed to be unrelated. These issues were sporadic, and at times would appear to go away for weeks at a time.
However after looking at a number of things including application code and server config, we determined that the problem was more than likely in the network. A great command for doing this is to use a constant ping but with a large packet size.
PING x.x.x.x -l 5000 -t
If there are issues within the network this should at least show them up if left running for a period of time. In our case it showed that we were having random timeouts.
The network was a very flat design of 6 switch stacks with 4 port LAGS defined between each switch stack. The switches in question were DELL N2000 series switches
So how to look for problems, now the command line syntax for the DELLs is very cisco-esque, not 100% but close enough to at least get you pointing in the right direction.
So the command that I used to review was this
show interfaces counters errors
This gives us an output table like below
Align-Err: Alignment errors
FCS-Err: FCS errors
Xmit-Err: Total transmit errors
Rcv-Err: Total packets received with MAC errors
UnderSize: Fragments/undersize received
OutDiscard: Total transmit packets discarded
Port Align-Err FCS-Err Xmit-Err Rcv-Err UnderSize OutDiscard
--------- ---------- ---------- ---------- ---------- ---------- ----------
Gi1/0/1 0 0 0 0 0 0
Gi1/0/2 0 0 0 0 0 0
This command was run on all 6 switch stacks, and on one of the switches we show errors on one of the port channels.
Gi1/0/21 0 0 0 0 0 105597
Gi1/0/22 0 0 0 0 0 147720
Gi1/0/23 0 99521583 0 99835323 812 6386
Gi1/0/24 0 2 0 2 0 4006
Port
Channel Align-Err FCS-Err Xmit-Err Rcv-Err UnderSize OutDiscard
--------- ---------- ---------- ---------- ---------- ---------- ----------
Po1 0 99521585 0 99835325 812 19945
Po2 0 0 0 0 0 301690
You can see that only one of the ports in the LAG was having problems. Once found this began to tie together all the problems we were seeing and also explained the intermittent nature of our fault. The switches saw that the ports were up and to it the link looked ok. We run the command a few times, each time showing the errors increasing. Once this was identified we replaced the cable... and the errors stopped.
Port-Channels (LAGS) do not spread traffic across the ports by packets they do it by session, what I mean is that if you copy a file across that port channel it will only go across 1 of the cables (i.e you get a maximum of 1Gbps per transfer). You may have a port-channel with 4Gbps bandwidth, but this actually equates to you can have 4 separate 1Gbps transfers occurring at the same time.
So in our case, the faulty connection was not fully down, it was just corrupting packets. Therefore the unlucky transfer that got put onto this link would be very slow. This was what the large PING was doing, every now and again the ping would be directed ontot he faulty link and it would then timeout.
Now we have found the problem, I can look back at some of the wireshark captures we got that showed a number of duplicate packets on slow running transfers.
I have now reset the counters, and currently we have no errors.... :) Just for reference the command to clear the counters is
clear counters port-channel 1
However after looking at a number of things including application code and server config, we determined that the problem was more than likely in the network. A great command for doing this is to use a constant ping but with a large packet size.
PING x.x.x.x -l 5000 -t
If there are issues within the network this should at least show them up if left running for a period of time. In our case it showed that we were having random timeouts.
The network was a very flat design of 6 switch stacks with 4 port LAGS defined between each switch stack. The switches in question were DELL N2000 series switches
So how to look for problems, now the command line syntax for the DELLs is very cisco-esque, not 100% but close enough to at least get you pointing in the right direction.
So the command that I used to review was this
show interfaces counters errors
This gives us an output table like below
Align-Err: Alignment errors
FCS-Err: FCS errors
Xmit-Err: Total transmit errors
Rcv-Err: Total packets received with MAC errors
UnderSize: Fragments/undersize received
OutDiscard: Total transmit packets discarded
Port Align-Err FCS-Err Xmit-Err Rcv-Err UnderSize OutDiscard
--------- ---------- ---------- ---------- ---------- ---------- ----------
Gi1/0/1 0 0 0 0 0 0
Gi1/0/2 0 0 0 0 0 0
This command was run on all 6 switch stacks, and on one of the switches we show errors on one of the port channels.
Gi1/0/21 0 0 0 0 0 105597
Gi1/0/22 0 0 0 0 0 147720
Gi1/0/23 0 99521583 0 99835323 812 6386
Gi1/0/24 0 2 0 2 0 4006
Port
Channel Align-Err FCS-Err Xmit-Err Rcv-Err UnderSize OutDiscard
--------- ---------- ---------- ---------- ---------- ---------- ----------
Po1 0 99521585 0 99835325 812 19945
Po2 0 0 0 0 0 301690
You can see that only one of the ports in the LAG was having problems. Once found this began to tie together all the problems we were seeing and also explained the intermittent nature of our fault. The switches saw that the ports were up and to it the link looked ok. We run the command a few times, each time showing the errors increasing. Once this was identified we replaced the cable... and the errors stopped.
Port-Channels (LAGS) do not spread traffic across the ports by packets they do it by session, what I mean is that if you copy a file across that port channel it will only go across 1 of the cables (i.e you get a maximum of 1Gbps per transfer). You may have a port-channel with 4Gbps bandwidth, but this actually equates to you can have 4 separate 1Gbps transfers occurring at the same time.
So in our case, the faulty connection was not fully down, it was just corrupting packets. Therefore the unlucky transfer that got put onto this link would be very slow. This was what the large PING was doing, every now and again the ping would be directed ontot he faulty link and it would then timeout.
Now we have found the problem, I can look back at some of the wireshark captures we got that showed a number of duplicate packets on slow running transfers.
I have now reset the counters, and currently we have no errors.... :) Just for reference the command to clear the counters is
clear counters port-channel 1
No comments:
Post a Comment