We ran into an issue this week were we had to figure out how to transmit Unicode data between a C++ program and a Java program using TCP. We have a working program that is sending ASCII just fine. We expected the C++ Unicode work would be difficult and the Java work would be easy. It turns out the C++ side was not as difficult as we thought. The Java side turned out to be easy, but it took some time to figure out how to make it work. Here’s a quick explanation of what we found:
C++
Most of our C++ code is using the standard library (i.e. std::string) and boost::asio for the networking. After a bit if research, I created a tstring.hpp file using code I found here and here. I changed the networking code to use the new std::tstring instead of std::string and was able to deal with Unicode data. Here’s an example of the asio code to read data sent from Java. Java sends a header with “file0000000000” where the 10 digits contain the data length, followed by the data itself. Notice the buffer size is the size * sizeof(TCHAR) to account for the fact that wide char are twice the size of normal char.
class session : public boost::enable_shared_from_this<session> { public: void read_header() { buffer_.resize(14); boost::asio::async_read( socket_, boost::asio::buffer(buffer_, buffer_.size() * sizeof(TCHAR)), boost::bind( &session::handle_read_header, shared_from_this(), boost::asio::placeholders::error, boost::asio::placeholders::bytes_transferred)); } void handle_read_header(const boost::system::error_code& error, size_t bytes_transferred) { if (error) return; std::tstring data(buffer_.begin(), buffer_.end()); short length = boost::lexical_cast<short>(data.substr(4)); buffer_.resize(length); boost::asio::async_read( socket_, boost::asio::buffer(buffer_, buffer_.size() * sizeof(TCHAR)), boost::bind( &session::handle_read_data, shared_from_this(), boost::asio::placeholders::error, boost::asio::placeholders::bytes_transferred)); } void handle_read_data(const boost::system::error_code& error, size_t bytes_transferred) { if (error && error.value() != 2) return; std::tstring data(buffer_.begin(), buffer_.end()); // process data read_header(); } private: tcp::socket socket_; std::vector<TCHAR> buffer_; int count_; };
Java
For an example, we created a simple Java program that would load in a Unicode file, append a header and send it to the C++ code above. What took a great deal of time was figuring out how to get the data in the correct format. It turns out the way the C++ program was looking for data was in UTF-16LE format. We took our data string and called getBytes(“UTF-16LE”) to get the data in the correct format to send.
String data = getFileContents(); Formatter f = new Formatter(); String header = f.format("file%010d", data.length() * 2).toString(); data = header + data; byte[] buffer = data.getBytes("UTF-16LE"); Socket socket = new Socket(address, port); BufferedOutputStream bis = new BufferedOutputStream(socket.getOutputStream()); try { bis.write(buffer); bis.flush(); } finally { bis.close(); }
In the end, this turns out to be easy to do. It just takes a bit of time to figure out the settings. We had to spend time searching for bits and pieces of info to put this together. If you happen to know any good resources to help others with this, feel free to add a comment.