3

I am writing an application that uses OpenNETCF.IO.Serial (open source, see serial code here) for its serial communication on a Windows CE 6.0 device. This application is coded in C# in the Compact Framework 2.0. I do not believe the issue I am about to describe is specifically related to these details, but I may be proven to be wrong in that regard.

The issue I am having is that, seemingly randomly (read as: intermittent issue I cannot reliably duplicate yet), data will fail to transmit or be received until the device itself is rebooted. The Windows CE device communicates with a system that runs an entirely different application. Rebooting this other system and disconnecting/reconnecting communication cables does not appear to resolve this issue, only rebooting the Windows CE device.

The only sign of this issue occurring is a lack of a TxDone event from OpenNETCF firing (look for "TxDone();" in OpenNETCF.IO.Serial class), and no data being received, when I know for a fact that the connected system is sending data.

Any character value from 1 - 255 (0x01 - 0xFF) can be sent and received in our serial communication. Null values are discarded.

My serial settings are 38400 Baud, 8 data bits, no parity, 1 stop bit (38400, 8n1). I've set the input and output buffer sizes to 256 bytes. DataReceived event happens whenever we receive 1 or more characters, and transmission occurs when there's 1 or more bytes in the output buffer, since messages are of variable length.

No handshaking is used. Since this is RS422, only 4 signals are being used: RX+, RX-, TX+, TX-.

I receive a "DataReceived" event, I read all data from the input buffer and make my own buffer in my code to parse through it at my leisure outside of the DataReceived event. When I receive a command message, I send an quick acknowledgment message back. When the other system receives a command message from the Windows CE device, it will send a quick acknowledgment message back. Acknowledgment messages get no further replies since they're intended as a simple "Yep, got it." In my code, I receive/transmit through multiple threads, so I use the lock keyword so I'm not transmitting multiple messages simultaneously on multiple threads. Double checking through code has shown that I am not getting hung up on any locks.

At this point, I am wondering if I am continuously missing something obvious about how serial communication works, such as if I need to set some variable or property, rather than just reading from an input buffer when not empty and writing to a transmit buffer.

Any insight, options to check, suggestions, ideas, and so on are welcome. This is something I've been wrestling with on my own for months, I hope that answers or comments I receive here can help in figuring out this issue. Thank you in advance.

Edit, 2/24/2011:
(1) I can only seem to recreate the error on boot up of the system that the Windows CE device is communicating with, and not every boot up. I also looked at the signals, common mode voltage fluctuates, but amplitude of the noise that occurs at system boot up seems unrelated to if the issue occurs or not, I've seen 25V peak-to-peak cause no issue, when 5V peak-to-peak the issue reoccurred).
Issue keeps sounding more and more hardware related, but I'm trying to figure out what can cause the symptoms I'm seeing, as none of the hardware actually appears to fail or shutdown, at least where I've been able to reach to measure signals. My apologies, but I will not be able to give any sort of part numbers of hardware parts, so please don't ask the components being used.

(2) As per @ctacke's suggestion, I ensured all transmits were going through the same location for maintainability, the thread safety I put in is essentially as follows:

lock(transmitLockObj)
{
    try
    {
        comPort.Output = data;
    }
    [various catches and error handling for each]
}

(3) Getting UART OVERRUN errors, in a test where <10 bytes were being sent and received on about a 300msec time interval at 38400 Baud. Once it gets an error, it goes to the next loop iteration, and does NOT run ReadFile, and does NOT run TxDone event (or any other line checking procedures). Also, not only does closing and reopening the port do nothing to resolve this, rebooting the software while the device is still running doesn't do anything, either. Only a hardware reboot.

My DataReceived event is as follows:

try
{
    byte[] input = comPort.Input; //set so Input gets FULL RX buffer

    lock(bufferLockObj)
    {
        for (int i = 0; i < input.Length; i++)
        {
            _rxRawBuffer.Enqueue(input[i]);
            //timer regularly checks this buffer and parses data elsewhere
            //there, it is "lock(bufferLockObj){dataByte = _rxRawBuffer.Dequeue();}"
            //so wait is kept short in DataReceived, while remaining safe
        }
    }
}
catch (Exception exc)
{
    //[exception logging and handling]
    //hasn't gotten here, so no point in showing
}

However, instantly after the WriteFile call did timed out the first time in the test was when I started getting UART OVERRUN errors. I honestly can't see my code causing a UART OVERRUN condition.

Thoughts? Hardware or software related, I'm checking everything I can think to check.

tshepang
  • 12,111
  • 21
  • 91
  • 136
Peter Lacerenza
  • 225
  • 2
  • 12
  • I've had similar problems in the past on the full .NET framework (Windows XP) -- I ended up adding a hack that would close and reopen the serial port if data stopped; not ideal for a long-term solution though. I'm not sure if closing/reopening will bring it back online for you though. – Justin Feb 10 '11 at 15:10
  • Your buffers are small, the baudrate is high. Cr*p happens. – Hans Passant Feb 10 '11 at 15:23
  • @Hans, messages are of variable length, but largest possible message I've calculated out to be ~50 bytes, and that's a rare occurrence. When DataReceived event is properly triggered every time the serial lib checks if there's 1 or more bytes in the buffer, 256 has proven to be a decent size in theory. I've also attempted resizing buffers in testing to be sure practical experience meets theory, buffer sizes didn't resolve anything. Sadly can't do anything about the baudrate in testing. – Peter Lacerenza Feb 10 '11 at 19:56
  • @Justin, tested closing/reopening (with arbitrary, excessive 2 second wait time between closing and reopening) early on in my troubleshooting. Communication would recover once in a blue moon, if that. It did not recover anywhere near reliably, and I don't know if it was simply luck of the draw that it had recovered in the circumstances that it did through close/reopen. Thank you for the tip, though, I agree that it is a solution that should be attempted when someone has an issue like this. – Peter Lacerenza Feb 10 '11 at 20:40
  • Can you reproduce the problem by p/invoking [the WinAPI serial port functions](http://msdn.microsoft.com/en-us/library/aa910699.aspx)? – Ben Voigt Feb 24 '11 at 17:24
  • @Ben, The OpenNETCF serial library works through p/invoking the WinAPI serial port functions, i.e. ReadFile, WriteFile, ClearCommError, WaitCommEvent, etc. So I suppose technically speaking, the answer is yes? I'm already working through p/invoking WinAPI serial port functions, it just so happens that it's being done by a serial library written by another that is open source (again, the download link to the original source code is linked at the top of my initial question). – Peter Lacerenza Feb 24 '11 at 20:35

2 Answers2

1

Everything sounds right, but your observations kind of show that they're not.

Since you've stated that you're sending from multiple threads, the first thing I'd do is put in some sort of mechanism for sending where all send requests come into one location before calling out to the serial object instance. Sure, you say that you've ensured you have thread safety, but serializing these calls through one location would help reinforce that (and make the code a bit more maintainable/extensible).

Next I'd probably add some temp handling in the Serial lib to specifically set an event or break in the debugger when you've done a Tx but the TxDone event doesn't fire within some bounding period. It's always possible that the Serial lib has a bug in it (trust me, the author of that code is far from infallible) where some race condition is getting by.

ctacke
  • 66,480
  • 18
  • 94
  • 155
  • Followed your transmit from same location in code suggestion, no change in performance, but I do agree on the maintainability point. In serial lib, added an event for (1) right before ReadFile call, (2) right after the ReadFile call that it finished doing so, on hunch of ReadFile timeout, and (3) in an else of the if(...){DataReceived event}, just in case. I ended up logging UART OVERRUN errors, where it never runs ReadFile again. Ever. DataReceived event only puts bytes in a queue and leaves, and we're talking about <10 bytes every 300ms at 38400 Baud in this test. – Peter Lacerenza Feb 24 '11 at 16:22
0

Thank you everyone who responded. We've found that this actually appears to be hardware-related. I'm afraid I can't give more information than this, but I thank everyone who contributed possible solutions or troubleshooting steps.

Peter Lacerenza
  • 225
  • 2
  • 12