Let's Ginger-Fi it!

A blog about my adventures in Wi-Fi

CWNE Essay #2 Proving it’s not the wireless

In this essay I will describe how I used frame analysis to diagnose a RADIUS Certificate issue.

Introduction:

Proving “it’s not the wireless” seems to be a reoccurring theme I hear from other wireless people. Since to a lot of the users the wireless is the “network” it comes up quite often. Sometimes you even have to prove it to the other members of your networking team.

The Aruba controllers we have are old and were supposed to be replaced with our new Cisco 9800s about 3 years ago. The outside contractor is behind on standing up the new Cisco DNA/SDN/single pane of glass network, so we have been keeping the Aruba controllers alive.

The Problem:

An issue arose where one of the two wireless controllers (could have been either one at any random time) would have a soft failure. It would fail enough to drop all its APs and therefore clients, but not enough to indicate to the other controller that it had died, and that that other controller should pick up the slack. This impacted students, faculty, and staff multiple times over a two-week period and resulted in a few TAC calls, our core network team having to do nightly reboots, and then finally a patch from Aruba. Everyone on campus was now watching the wireless network very closely. During this time, we also found ourselves victim to a cyber-attack leading to lots of password changes and monitoring. Because of this, some network changes were made by our core team. We (on the network side) were unaware that one of these changes was to throttle some of the traffic. When we came in the following morning, lots of things seemed to not be working (again), the wireless was quickly blamed.

I noticed I could get connected to the network and get to internal sites but could reach nothing external. Our internal speed test site (to the gateway) was showing great speeds, but I could not get speedtest.net or fast.com to load enough to run a speed test. I brought this to the core team and after some pushing, they admitted to making changes the previous evening. It turned out they had throttled traffic at the border firewall but had mistakenly overdone it and almost nothing was getting out. This was changed and everyone was working fine again but it was still being referred to as a wireless incident.

Fast forward another 2 weeks and we start getting reports of wireless being down around many spots on campus. The majority of people could not connect but a few people (all phones) in the same spaces could. This was happening in every one of our buildings on all four campuses.

The Solution:

I checked the Aruba controllers to make sure we were not experiencing another “soft failure” and checked with the core team to see if there had been any more changes; nothing. While the director and the core team were trying to hunt down the issue, I started taking packet captures while trying to connect to the wireless network. I could see in the capture I was connecting to the AP but failing during the authentication. I asked if there could be an issue with the RADIUS server because I could see a certificate expired response but was told that if RADIUS were down, systems would get an alert and that the certificate was just renewed before the systems tech went on vacation. I then decided to forget the network on my phone and reconnect but chose “do not validate” for my CA certificate option. I was able to connect. I relayed this new information in our Teams chat and the response from core was “The cert is related to authentication not AP.”  I responded with “I know, that is why it seems like it is a RADIUS/Certificate issue” and attached a screenshot of the PCAP below.

They then got a hold of systems and responded back to me saying, “systems have verified the cert expired and will install the new one.” The tech on vacation had meant to renew the certificate before he left but forgot.

The next question came from our director (who is a super security minded person) asking “But why were some users getting connection but some not?” I responded with “If they’ve been clicking “do not validate cert”, they would have been able to connect.” It was the type of thing the director, our team and many other people would have never thought of because it is not the proper procedure. To the average user it is the easiest option because they would have to look up instructions somewhere to know what to enter in the certificate validation field.

This made me curious about how user friendly the instructions we provide to user were.  This is how I found that the instructions for Android devices on our official instruction page clearly said to select “Do not verify.”

There had been a miscommunication after a troubleshooting ticket between core and the service desk, and the solution to the certificate issue with Android 11 (which was to select do not verify) was mixed in with the general Android instructions. I put in a request to have the instructions on the site updated right away.

In Conclusion:

This all showed me that there really is no arguing with a packet capture and that I should have just led with that in the chat instead of trying to convince them with words. It also showed the whole networking team how important the wireless network has become on campus and how an issue at almost any stage in the network will present as an issue with the wireless to the users.

Published by

Leave a comment