Hi Guys,
Really hoping someone has some ideas about what is going on with this IPSec tunnel I am trying to get working.
First off, this is the topology:
![VPN Issue.png VPN Issue.png]()
The AWS bit is largely negligible as this is just there to show that we are connected to AWS with an IPSec tunnel and are running BGP through it. We are then redistributing the BGP routes into OSPF. As it is a small network, everything is running in area0.
What we currently have is an OpenVPN tunnel between the 'Remote site' and the 'Colo' site. The Colo ER-8 router is also acting as an OpenVPN server for client VPN access. We have been seeing very high latency and CPU usage on the Colo ER-8, so to alleviate this, we are changing the tunnel to the remote site to be an IPSec tunnel, to make use of the hardware offloading for IPSec and lighten the load on the CPU.
The remote site is located in a shared office and sits behind a managed router for internet access. It's outside interface is using DHCP for the IP and default route, so we are using NAT-T to establish the VPN between the Colo and remote sites and use hardware IPSec offloading at both ends.
My plan to move over to the IPSec tunnel is to establish the IPSec tunnel first and configure OSPF with higher costs than are on the OpenVPN tunnel. Then it is just a case of shutting down the OpenVPN vtun interface (or raise the cost) and allow traffic to flow over the IPSec tunnel.
Before doing this, I ensured the tunnel is up and both ends are talking OSPF over it. I did this and all looks good, the OSPF timers are behaving as they should and could see 0% packet loss after sending a few thousand large pings down the tunnel.
So, the time came that I am moving the traffic over to the IPSec tunnel and I did so by shutting the OpenVPN vtun interface. This is where the problems begin. As soon as traffic starts flowing over the IPSec tunnel, we see major packet loss (50%+). The OSPF dead timer frequently expires, breaking the OSPF neighborship. Thinking the issue may just be to do with OSPF, I removed that from the equation and just statically routed traffic over the IPSec tunnel, but this did not help, we still see huge packet loss.
I have tried debugging the VPN, but I just can't seem to find any detailed enough info. I have tried the following:
show vpn debug
sudo swanctl --log
show vpn log
...and there is no info of real use in any of those. The main bit I have got from these commands is that the remote site is frequently re-initiating the connection fort no apparent reason:
xxx.xxx.xxx.xxx is initiating a Main Mode IKE_SA
The IKE lifetime is 28800, the ESP lifetime is 3600 and the above log entry is appearing way more frequently than every hour.
On the remote site (on the Unifi GW4) I am seeing very frequent re-key and delete events in the log:
Dec 19 08:18:00 04[IKE] <peer-1.1.1.1-tunnel-vti|6> CHILD_SA peer-1.1.1.1-tunnel-vti{1} established with SPIs c0908a3e_i c1f57195_o and TS 0.0.0.0/0 === 0.0.0.0/0
Dec 19 08:19:43 05[KNL] creating rekey job for ESP CHILD_SA with SPI c5ae299b and reqid {1}
Dec 19 08:22:51 14[KNL] creating rekey job for ESP CHILD_SA with SPI c1f79cb6 and reqid {1}
Dec 19 08:35:33 07[KNL] creating delete job for ESP CHILD_SA with SPI c5ae299b and reqid {1}
Dec 19 08:35:33 07[IKE] <peer-1.1.1.1-tunnel-vti|6> closing expired CHILD_SA peer-1.1.1.1-tunnel-vti{1} with SPIs c5ae299b_i c1f79cb6_o and TS 0.0.0.0/0 === 0.0.0.0/0
Dec 19 08:35:33 02[KNL] creating delete job for ESP CHILD_SA with SPI c1f79cb6 and reqid {1}
Dec 19 08:45:01 12[IKE] <peer-1.1.1.1-tunnel-vti|6> reauthenticating IKE_SA peer-1.1.1.1-tunnel-vti[6]
Dec 19 08:45:01 12[IKE] <peer-1.1.1.1-tunnel-vti|6> initiating Main Mode IKE_SA peer-1.1.1.1-tunnel-vti[7] to 1.1.1.1
Dec 19 08:45:02 05[IKE] <peer-1.1.1.1-tunnel-vti|7> IKE_SA peer-1.1.1.1-tunnel-vti[7] established between 10.80.125.63[10.80.125.63]...1.1.1.1[1.1.1.1]
Dec 19 09:01:11 14[KNL] creating rekey job for ESP CHILD_SA with SPI c0908a3e and reqid {1}
Dec 19 09:01:11 07[IKE] <peer-1.1.1.1-tunnel-vti|7> CHILD_SA peer-1.1.1.1-tunnel-vti{1} established with SPIs ccc201f1_i c2b5449c_o and TS 0.0.0.0/0 === 0.0.0.0/0
Dec 19 09:08:56 14[KNL] creating rekey job for ESP CHILD_SA with SPI c1f57195 and reqid {1}
(1.1.1.1 is the public IP of the Colo router)
I am thinking this is a bug or an issue with IPSec offloading as I can't think of any reason why it would work fine before routing traffic over it and then fall to pieces as soon as it is commissioned, so I'd really like some input from UBNT on this as I feel the solution is out of my hands.
Help me Ubnt, you're my only hope!
Configs are attached