|  | Force 'char' to be implemented as 'unsigned char'? |  | |
| | | Matthias Kluwe |  |
| Posted: Sun Aug 17, 2008 3:28 am Post subject: Force 'char' to be implemented as 'unsigned char'? |  |
| |  | |
Hi!
I came across a coding standard manual (http://www.codingstandard.com/ HICPPCM/index.html) and decided to make up my mind about the suggestions in there.
I quickly found that I don't understand the justifications for many of the given rules very well. Here's an example (item 2.2):
"Specify in your compiler configuration that plain 'char' is implemented as 'unsigned char'. Justification: Support 8-bit ASCII for internationalisation. The size and sign of char is implementation- defined. If the range of type char corresponds to 7-bit ASCII, and 8-bit characters are used, unpredictable behaviour may result. Otherwise prefer to use wchar_t type."
Hmm, I don't feel very well forcing my compiler in an area where the language does not enforce. Second, I don't see why an '8-bit character' having a 'negative value' (encoded as a char) could ever be harmful. Do you know any examples?
Regards, Matthias
-- [ See LINK for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
| |
| | | Jack Klein |  |
| Posted: Sun Aug 17, 2008 2:50 pm Post subject: Re: Force 'char' to be implemented as 'unsigned char'? |  |
| |  | |
On Sat, 16 Aug 2008 21:28:31 CST, Matthias Kluwe <mkluwe@gmail.com> wrote in comp.lang.c++.moderated:
| Quote: | Hi!
I came across a coding standard manual (http://www.codingstandard.com/ HICPPCM/index.html) and decided to make up my mind about the suggestions in there.
I quickly found that I don't understand the justifications for many of the given rules very well. Here's an example (item 2.2):
"Specify in your compiler configuration that plain 'char' is implemented as 'unsigned char'. Justification: Support 8-bit ASCII for internationalisation. The size and sign of char is implementation- defined. If the range of type char corresponds to 7-bit ASCII, and 8-bit characters are used, unpredictable behaviour may result. Otherwise prefer to use wchar_t type."
|
I would be suspicious of someone pontificating on coding standards who can't even be bothered to get either terminology, or logic, right. Here are just a few examples.
"8-bit ASCII" -- no such thing. ASCII, which is pretty much an obsolete concept today, always was, is, and always will be a 7-bit code. There is no such thing as "8-bit ASCII", and never has been.
"If the range of type char corresponds to 7-bit ASCII" -- no such C++ system exists, so the wording is non-sensical. The C++ standard requires a conforming implementation to provide a minimum range of values for type char, depending on whether it is signed or unsigned, or -127 to +127, or 0 to 255. In either case, plain char can hold at least 127 values that are not in the ASCII character set.
"Otherwise prefer to use wchar_t type", which is neither required nor guaranteed to be different from the plain char type.
| Quote: | Hmm, I don't feel very well forcing my compiler in an area where the language does not enforce. Second, I don't see why an '8-bit character' having a 'negative value' (encoded as a char) could ever be harmful. Do you know any examples?
|
Careless mixing of signed and unsigned types, not just signed char or plain char, if signed, can cause unexpected problems.
-- Jack Klein Home: LINK FAQs for comp.lang.c LINK comp.lang.c++ LINK alt.comp.lang.learn.c-c++ LINK
[ See LINK for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
| |
| | | Mathias Gaunard |  |
| Posted: Sun Aug 17, 2008 4:58 pm Post subject: Re: Force 'char' to be implemented as 'unsigned char'? |  |
On 17 août, 05:28, Matthias Kluwe <mkl...@gmail.com> wrote:
| Quote: | Hmm, I don't feel very well forcing my compiler in an area where the language does not enforce. Second, I don't see why an '8-bit character' having a 'negative value' (encoded as a char) could ever be harmful. Do you know any examples?
|
It's not harmful unless you use arthmetic or bitwise operations on it.
-- [ See LINK for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
| |
| | | Francis Glassborow |  |
| Posted: Sun Aug 17, 2008 4:58 pm Post subject: Re: Force 'char' to be implemented as 'unsigned char'? |  |
Matthias Kluwe wrote:
| Quote: | Hmm, I don't feel very well forcing my compiler in an area where the language does not enforce. Second, I don't see why an '8-bit character' having a 'negative value' (encoded as a char) could ever be harmful. Do you know any examples?
bool foo(unsigned char c, char d){ |
return c<d; }
int main(){ std::cout << foo(3, 130); }
What should the output be? Of course this program is silly but it illustrates the problem of char being implemented in one of two ways.
-- Note that robinton.demon.co.uk addresses are no longer valid.
[ See LINK for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
| |
| | | Daniel T. |  |
| Posted: Sun Aug 17, 2008 4:58 pm Post subject: Re: Force 'char' to be implemented as 'unsigned char'? |  |
| |  | |
Matthias Kluwe <mkluwe@gmail.com> wrote:
| Quote: | I came across a coding standard manual (http://www.codingstandard.com/ HICPPCM/index.html) and decided to make up my mind about the suggestions in there.
I quickly found that I don't understand the justifications for many of the given rules very well. Here's an example (item 2.2):
"Specify in your compiler configuration that plain 'char' is implemented as 'unsigned char'. Justification: Support 8-bit ASCII for internationalisation. The size and sign of char is implementation- defined. If the range of type char corresponds to 7-bit ASCII, and 8-bit characters are used, unpredictable behaviour may result. Otherwise prefer to use wchar_t type."
Hmm, I don't feel very well forcing my compiler in an area where the language does not enforce. Second, I don't see why an '8-bit character' having a 'negative value' (encoded as a char) could ever be harmful. Do you know any examples?
|
Since I have to deal with this on a regular basis, I'll give you an example...
On the Nintendo DS and Nintendo Wii, I have an array of images, each one corresponding to a particular letter. Displaying a word amounts to drawing the correct images in the correct order...
// its a bit more complex than the below of course, but this gets // the point across.
void display( char* c, int x, int y ) { while ( *c ) { image[*c].draw( x, y ); y += image[*c].width(); ++c; } }
Do you see the problem with the above if a char is signed and someone tries to draw, "René"? (thats 0x52, 0x65, 0x6E, 0xE9.)
Also, converting from a char to a wchar_t leads to surprising (and incorrect) results sometimes:
char c = 0xE9; 'é' wchar_t wc = c; assert( wc == c ); // this will succeed assert( wc == 0xE9 ); // this will fail
-- [ See LINK for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
| |
| | | Jouko Koski |  |
| Posted: Sun Aug 17, 2008 6:53 pm Post subject: Re: Force 'char' to be implemented as 'unsigned char'? |  |
"Matthias Kluwe" <mkluwe@gmail.com> wrote:
| Quote: | I don't see why an '8-bit character' having a 'negative value' (encoded as a char) could ever be harmful. Do you know any examples?
|
Consider implementing a function for checking "good" characters (in the spirit of isalpha, isdigit etc.);
bool isgood(char c) { bool const * const good = { false, true, true /* etc. whatever... */ }; return good[c]; }
Of course, this sort of a simple implementation is may not be portable due to other reasons, but it silently assumes only non-negative char values.
-- Jouko
[ See LINK for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
| |
| | | Alf P. Steinbach |  |
| Posted: Mon Aug 18, 2008 1:28 pm Post subject: Re: Force 'char' to be implemented as 'unsigned char'? |  |
* Matthias Kluwe:
| Quote: | I don't see why an '8-bit character' having a 'negative value' (encoded as a char) could ever be harmful. Do you know any examples?
|
This should probably be a FAQ.
I find it a bit curious that respondents so far have not mentioned the standard library.
#include <ctype.h> #include <iostream>
int main() { using namespace std;
cout << isdigit( (unsigned char)'æ' ) << endl; // OK cout << isdigit( 'æ' ) << endl; // !OK, UB. }
Cheers, & hth.,
- Alf
-- A: Because it messes up the order in which people normally read text. Q: Why is it such a bad thing? A: Top-posting. Q: What is the most annoying thing on usenet and in e-mail?
[ See LINK for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
| |
| | | Guest |  |
| Posted: Mon Aug 18, 2008 9:08 pm Post subject: Re: Force 'char' to be implemented as 'unsigned char'? |  |
| |  | |
On Aug 17, 7:50 am, Jack Klein <jackkl...@spamcop.net> wrote:
| Quote: | On Sat, 16 Aug 200821:28:31 CST, Matthias Kluwe <mkl...@gmail.com wrote in comp.lang.c++.moderated: I quickly found that I don't understand the justifications for many of the given rules very well. Here's an example (item 2.2):
"Specify in your compiler configuration that plain 'char' is implemented as 'unsigned char'. Justification: Support8-bit ASCII for internationalisation. The size and sign of char is implementation- defined. If the range of type char corresponds to 7-bit ASCII, and8-bit characters are used, unpredictable behaviour may result. Otherwise prefer to use wchar_t type."
I would be suspicious of someone pontificating on coding standards who can't even be bothered to get either terminology, or logic, right. Here are just a few examples.
"8-bit ASCII" -- no such thing. ASCII, which is pretty much an obsolete concept today, always was, is, and always will be a 7-bit code. There is no such thing as "8-bit ASCII", and never has been.
"If the range of type char corresponds to 7-bit ASCII" -- no such C++ system exists, so the wording is non-sensical. The C++ standard requires a conforming implementation to provide a minimum range of values for type char, depending on whether it is signed or unsigned, or -127 to +127, or 0 to 255. In either case, plain char can hold at least 127 values that are not in the ASCII character set.
|
Now, IIRC, there are some implementations out there which: 1- Are hardcoded to use ASCII 2- Use 8 bit bytes 3- Decided to use the unused 8th bit as a parity check bit.
Thus, if you're working on one of these machines, converting 'a' to a UTF8 character may not produce the intended result, because the compiler and hardware will just copy the 8 bit byte over, including the 8th parity bit, which is not the intended UTF8 character.
However, if that was the intent of the document quoted by the OP, they did a pisspoor job trying to get that across.
-- [ See LINK for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
| |
| | | Greg Herlihy |  |
| Posted: Tue Aug 19, 2008 12:42 am Post subject: Re: Force 'char' to be implemented as 'unsigned char'? |  |
| |  | |
On Aug 18, 2:08 pm, JoshuaMaur...@gmail.com wrote:
| Quote: | On Aug 17, 7:50 am, Jack Klein <jackkl...@spamcop.net> wrote:
"If the range of type char corresponds to 7-bit ASCII" -- no such C++ system exists, so the wording is non-sensical. The C++ standard requires a conforming implementation to provide a minimum range of values for type char, depending on whether it is signed or unsigned, or -127 to +127, or 0 to 255. In either case, plain char can hold at least 127 values that are not in the ASCII character set.
Now, IIRC, there are some implementations out there which: 1- Are hardcoded to use ASCII 2- Use 8 bit bytes 3- Decided to use the unused 8th bit as a parity check bit.
|
There are no "unused" bits in a C++ char type. Instead, every bit must participate in the char's value representation. So the possiblity that a C++ char could have any kind of parity bit - is ruled out. And for unsigned chars, the requirements are even more strict: for an unsigned char, every possible bit pattern must represent a valid number. Whereas the presence of a parity bit in a char would necessitate that certain bit patterns do not represent valid numbers.
| Quote: | Thus, if you're working on one of these machines, converting 'a' to a UTF8 character may not produce the intended result, because the compiler and hardware will just copy the 8 bit byte over, including the 8th parity bit, which is not the intended UTF8 character.
|
So under these circumstances, converting a char value to, say, an int, will - for certain char values - also wind up adding 128 to the value of the int?
Greg
-- [ See LINK for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
| |
| | | Guest |  |
| Posted: Tue Aug 19, 2008 9:43 am Post subject: Re: Force 'char' to be implemented as 'unsigned char'? |  |
| |  | |
On Aug 18, 5:42 pm, Greg Herlihy <gre...@mac.com> wrote:
| Quote: | On Aug 18, 2:08 pm, JoshuaMaur...@gmail.com wrote:
On Aug 17, 7:50 am, Jack Klein <jackkl...@spamcop.net> wrote:
"If the range of type char corresponds to 7-bit ASCII" -- no such C++ system exists, so the wording is non-sensical. The C++ standard requires a conforming implementation to provide a minimum range of values for type char, depending on whether it is signed or unsigned, or -127 to +127, or 0 to 255. In either case, plain char can hold at least 127 values that are not in the ASCII character set.
Now, IIRC, there are some implementations out there which: 1- Are hardcoded to use ASCII 2- Use 8 bit bytes 3- Decided to use the unused 8th bit as a parity check bit.
There are no "unused" bits in a C++ char type. Instead, every bit must participate in the char's value representation. So the possiblity that a C++ char could have any kind of parity bit - is ruled out. And for unsigned chars, the requirements are even more strict: for an unsigned char, every possible bit pattern must represent a valid number. Whereas the presence of a parity bit in a char would necessitate that certain bit patterns do not represent valid numbers.
|
If I'm reading the standard correctly, which I think I am, it is not specified how 'a' gets encoded. It may be ASCII. It may not be. Thus: char a = 'a'; int_32 codepoint c = a; utf8string str; str.append(c); On most machines, they implement English string literals as ASCII, and thus are also UTF8. However, the machine may not encode English string literals in ASCII.
For example, a machine may encode string literals as an extended ASCII, the least significant 7 bits as ASCII with the 8th bit as a parity check, and this is allowed by the standard. (Within some limitation. I think the standard requires that the character digits '0', '1', ..., '9' map to integers '0', '0'+1, ..., '0'+9, so ASCII encoding with an 8th bit parity check wouldn't actually be allowed...) This would result in the above code fragment doing not what you wanted. It would be annoying to catch, as it would work on most machines as most machines encode English string literals in ASCII.
I brought the 8th bit parity encoding up because I think I remember reading about real systems out there which do that, and this gotcha may have been what the quote in the OP was talking about. (Though using my '0', '1', ..., '9' argument, such a system is probably the figment of my imagination, thus confirming the source in the OP has no clue what it's talking about.)
| Quote: | Thus, if you're working on one of these machines, converting 'a' to a UTF8 character may not produce the intended result, because the compiler and hardware will just copy the 8 bit byte over, including the 8th parity bit, which is not the intended UTF8 character.
So under these circumstances, converting a char value to, say, an int, will - for certain char values - also wind up adding 128 to the value of the int?
|
No, but you might not get the usual ASCII encoding.
-- [ See LINK for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
| |
| Page 1 of 2 .:. Goto page 1, 2 Next | |
|
|