September 22nd Resolution:
Since the maintenance on Thursday, September 18th, the sluggishness of the CalArts internet connection has not resurfaced and is considered resolved. Network Operations investigated each component of the network during last week, coming to the conclusion that the issue was on the outer part of the network as all internal traffic was working appropriately.
Early last week we contacted our ISP and confirmed that traffic they were seeing was not sufficient to be considered problematic. NetOps then rebooted the primary router on Thursday and confirmed that it was not the culprit; load was extremely low and no errors were found. Focus was then directed to the next step in the connection: the core switch.
Working with Cisco we found an antiquated best-practice CPU usage policy in the switch which was allowing more activity than it should have. This issue hadn't presented itself since replacing the network equipment in May because we simply didn't have sufficient devices on the network to cause it.Our internet connection did hit our 250Mb/s cap a few times during that week, which exacerbated the problems and made troubleshooting even more difficult.
Each time a computer requests network activity it goes through the switch CPU, each request is processed and sent off to either the router, or another internal switch, wherever the computer was trying to go. This CPU policy added processing time to each request, which made each subsequent request take longer, which had a very quick cascading effect.
Basically, we had a sudden, and massive, traffic jam. The internet connection went from a-ok to completely clogged in a matter of moments.
During the maintenance on September 18th we noticed a couple of configuration issues with our wireless setup and how it connected to the rest of the network. That was quickly fixed and we're hoping there will be improvements to how the wireless works on the back end (if you notice any improvements please let us know). That wireless network fix reduced core CPU usage by 35%.
We haven't noticed a return of the speed problems for the past four days but will continue to monitor the network for that behavior so we can be more proactive if it does come back in the future.
1:00PM: Wireless maintenance is now completed and systems have been brought back online! We will continue to monitor the network and check the results of our maintenance period.
12:29PM: Wireless has successfully been brought down for maintenance.
September 18th 12:25PM: We will immediately be bringing down the wireless campus-wide for emergency network maintenance. The estimated downtime is approximately 30 minutes.
11:15PM: We have scheduled network maintenance tomorrow at 7:00AM and will post any results to this article thereafter.
September 17th 2:03PM: So far we are experiencing a fairly stable pattern of behavior. The sluggishness of internet connections occurs at almost exactly 12:10PM and ends at 5:00PM, except for monday when it started shortly after 11AM. This points towards a device that's being brought on campus that may be infected with malware, or some other type of behavior likely related to an individual. We are gathering our logs and utilizing vendor support in an effort to narrow down the root cause.
12:22PM: After examining the behavior yesterday we saw a massive influx of requests to our DNS servers, over 17,000 requests per second. As of 12:10PM today we are seeing the influx return and are expecting things to slow down once again as we determine the cause.
September 16th 8:44AM: The root cause of this slowdown (if indeed it came from our equipment) has yet to be determined. We saw very heavy traffic to our DNS servers, but the behavior has not come back after yesterday. We won't be considering this closed until the traffic pattern is gone for a week (to account for a computer that may only be here once a week).
1:54PM: Performance of the internet connection is being inconsistent. At times the latency appears to lighten up and act normally, only to go back to high latency shortly thereafter. NetOps is continuing to investigate.
1:28PM: The issue may be related to DNS servers.
1:24PM: We're experiencing exceedingly high latency on the campus internet connection. Network Operations is currently investigating and we will update this post as developments occur.
Current detected latency across internet backbones: