Sunday, January 5, 2020

Introduce End Devices to a Network and Introduce Problems

As I've touted for years now I have a network of XBees that I (basically) run my house with. I monitor room temperatures in key areas, control my pool, monitor my power, etc; all without a bunch of wires strung around the house. For the first year I ran a network transparent (AT) mode that broadcast to all devices and all of them listened to what was going on....that didn't work well.

I described what the problem turned out to be back then <link> and moved to a more directed network using API mode to control the traffic level and increase the speed of throughput. That all worked really well. Then I created the room temperature monitors. They were created as battery operated devices since I wanted to put them in places where there was no power; XBee routers use too much power for such an application, so they had to be XBee End Devices.

That was the beginning of a long time problem that I simply couldn't find, and has driven me nuts a few times before finally getting the entire mess to work again.

The symptom was that a device would leave or get kicked off the network and then simply refuse to join back in. I'd go for a few days and a sensor would leave and no amount of resets, power cycles, slaps or flights across the room would get it back on the network. Hell, I even programmed another XBee and put it in the same place and it wouldn't work. Then, seemingly at random, it would connect and start working all by itself.

Months of watching the XBee traffic after adding a ton of logging to almost every device in the house led me to nothing at all. Reading every blog and question remotely related on the web told me nothing. I was completely baffled by this problem.

Some of the things I tried were to automatically reset the network to force it to reform using the NR=1 command. This dumps all the routing tables and everything rejoins. This would work, but if it happened to often, the entire network would go down and I had to intercede at each device to get it back up.

Hook up a tablet using an OTG cable hooked to the device that most often failed and monitoring the activity for hours hoping to get a clue what was going on. This was cool because it allowed me to learn how to watch a device using something that wasn't a laptop running the entire Arduino IDE. I could plug into an active device and watch what was being logged without resetting the device. This is a nice thing to have available, but it didn't help find the problem.

I had the device reset itself, issue it's own NR=0 command to clear the local tables, reset the XBee, just about anything I could think of and nothing helped. I could have ignored it if I was only reporting temperatures, but two of the sensors were serving as the temperature sensors for my air conditioning system.

It really sucks when the cooling stops at 110F outside and the house heats up. It sucks about equally when the heater sticks on and the temperature goes up to 90+ inside on a cold day. Power bill didn't like that much either. I didn't want to break down and go back to the old method of measuring temperature, the sensors made the house much more comfortable.

I finally got a hint from a question asked about the XBee end devices not being able to rejoin a network. Seems the XBee routers have a table of 12 entries reserved for end devices that they can parent. That's cool, but I don't have 12 end devices. I still read the device tables on the XBees looking for what the heck was going on though. Then I found it.

I had relatives visit during Christmas and they brought their cell phones. The folk (my kids and grandkids) are ALWAYS on their phone. Either talking, playing games, texts, whatever; their eyes and hands are literally glued to the phone. The increased RF and WIFI traffic saturated my house and the network struggled trying to get packets through the interference that comes with low power RF activities. Devices disconnected and couldn't rejoin, packets got lost in the ocean of packets from all the devices, it was a total mess. I dug in again to see if I could get a clue.

I actually found the problem. What was happening to me is that the XBee end devices have to check in periodically to maintain their connection. If you wait too long, the table is purged to conserve the device table space for end devices. The time allowed is set by parameters on the XBee router and the end device needs to check in often enough not to get purged. I was using a 2 minute timeout on the temperature sensor and the default on the router.

To make things worse, I was using hardware control of the sleep period, and not correctly handling the interaction of the Arduino and the XBee conversation.

A couple of corrections such that I would send the temperature message, ask the XBee to go to sleep, then WAIT until it actually went to sleep before sleeping the Arduino made things much better. I allowed the end device to exhaust the stored messages that the XBee router parent was holding by just waiting until they all came in. The final item was to extend the XBee router timeout to way higher than necessary for a couple of missed transactions (like a full day) took care of the problem of it not being able to rejoin.

I was actually preventing it from rejoining by sleeping the device too quickly; it just couldn't get back in before I told it to shut down.

Why don't other people have this problem? I think they do sometimes, but didn't spend the time it took me to chase it down. I spent months watching and trying things before I stumbled on it mostly by accident looking at the tables because of some other problem someone else had with their network.

My network is humming along with only an occasional missed message. The extended awake time hasn't seemed to be a problem with the battery life either. The XBee trying to rejoin was a heavier load on the battery than the extra time the receiver is on. Transmit takes more power than receive, and I only transmit one message every two minutes, so the tiny overhead of the acknowledge packets isn't noticeable.

It's been seven full days of bliss because all the sensors and control systems are working perfectly. The network even has more capacity available for even more XBees. This is really tempting because my indoor freezer has a failing thermostat. Stupid thermostats on freezers are expensive and I already monitor the temperature inside it. It may be time to take complete control of the freezer. I wouldn't even consider that with the devices acting the way they were.

I know, in the scheme of things this short a period of time doesn't actually prove the problem is gone. But, the instant clearing of problems that had culminated with the increased number of cell phones pretty much convinces me I have it taken care of.

Maybe I can start thinking about something else now.