Index: gcc/ChangeLog

2005-03-14 Geoffrey Keating <geoffk@apple.com> * doc/cppopts.texi (-fexec-charset): Add concept index entry. (-fwide-exec-charset): Likewise. (-finput-charset): Likewise. * doc/invoke.texi (Warning Options): Document -Wnormalized=. * c-opts.c (c_common_handle_option): Handle -Wnormalized=. * c.opt (Wnormalized): New. Index: libcpp/ChangeLog 2005-03-14 Geoffrey Keating <geoffk@apple.com> * init.c (cpp_create_reader): Default warn_normalize to normalized_C. * charset.c: Update for new format of ucnid.h. (ucn_valid_in_identifier): Update for new format of ucnid.h. Add NST parameter, and update it; update callers. (cpp_valid_ucn): Add NST parameter, update callers. Replace abort with cpp_error. (convert_ucn): Pass normalize_state to cpp_valid_ucn. * internal.h (struct normalize_state): New. (INITIAL_NORMALIZE_STATE): New. (NORMALIZE_STATE_RESULT): New. (NORMALIZE_STATE_UPDATE_IDNUM): New. (_cpp_valid_ucn): New. * lex.c (warn_about_normalization): New. (forms_identifier_p): Add normalize_state parameter, update callers. (lex_identifier): Add normalize_state parameter, update callers. Keep the state current. (lex_number): Likewise. (_cpp_lex_direct): Pass normalize_state to subroutines. Check it with warn_about_normalization. * makeucnid.c: New. * ucnid.h: Replace. * ucnid.pl: Remove. * ucnid.tab: Make appropriate for input to makeucnid.c. Remove comments about obsolete version of C++. * include/cpplib.h (enum cpp_normalize_level): New. (struct cpp_options): Add warn_normalize field. Index: gcc/testsuite/ChangeLog 2005-03-14 Geoffrey Keating <geoffk@apple.com> * gcc.dg/cpp/normalize-1.c: New. * gcc.dg/cpp/normalize-2.c: New. * gcc.dg/cpp/normalize-3.c: New. * gcc.dg/cpp/normalize-4.c: New. * gcc.dg/cpp/ucnid-4.c: New. * gcc.dg/cpp/ucnid-5.c: New. * g++.dg/cpp/normalize-1.C: New. * g++.dg/cpp/ucnid-1.C: New. From-SVN: r96459
2024-11-21 13:40:47 +00:00 · 2005-03-15 00:36:33 +00:00 · 2005-03-15 00:36:33 +00:00 · 50668cf626
commit 50668cf626
parent cd8b38b9eb
24 changed files with 1708 additions and 548 deletions
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@ -1,3 +1,12 @@
+2005-03-14  Geoffrey Keating  <geoffk@apple.com>
+
+	* doc/cppopts.texi (-fexec-charset): Add concept index entry.
+	(-fwide-exec-charset): Likewise.
+	(-finput-charset): Likewise.
+	* doc/invoke.texi (Warning Options): Document -Wnormalized=.
+	* c-opts.c (c_common_handle_option): Handle -Wnormalized=.
+	* c.opt (Wnormalized): New.
+
 2005-03-14  Devang Patel  <dpatel@apple.com>

 	* doc/invoke.texi: Add reference to Visibility document.
--- a/gcc/c-opts.c
+++ b/gcc/c-opts.c
@ -460,6 +460,19 @@ c_common_handle_option (size_t scode, const char *arg, int value)
      cpp_opts->warn_multichar = value;
      break;

+    case OPT_Wnormalized_:
+      if (!value || (arg && strcasecmp (arg, "none") == 0))
+	cpp_opts->warn_normalize = normalized_none;
+      else if (!arg || strcasecmp (arg, "nfkc") == 0)
+	cpp_opts->warn_normalize = normalized_KC;
+      else if (strcasecmp (arg, "id") == 0)
+	cpp_opts->warn_normalize = normalized_identifier_C;
+      else if (strcasecmp (arg, "nfc") == 0)
+	cpp_opts->warn_normalize = normalized_C;
+      else
+	error ("argument %qs to %<-Wnormalized%> not recognized", arg);
+      break;
+
    case OPT_Wreturn_type:
      warn_return_type = value;
      break;
--- a/gcc/c.opt
+++ b/gcc/c.opt
@ -285,6 +285,10 @@ Wnonnull
 C ObjC Var(warn_nonnull)
 Warn about NULL being passed to argument slots marked as requiring non-NULL

+Wnormalized=
+C ObjC C++ ObjC++ Joined
+-Wnormalized=<id|nfc|nfkc>	Warn about non-normalised Unicode strings
+
 Wold-style-cast
 C++ ObjC++ Var(warn_old_style_cast)
 Warn if a C-style cast is used in a program
--- a/gcc/doc/cppopts.texi
+++ b/gcc/doc/cppopts.texi
@ -530,12 +530,14 @@ ignored.  The default is 8.

@item -fexec-charset=@var{charset}
@opindex fexec-charset
+@cindex character set, execution
 Set the execution character set, used for string and character
 constants.  The default is UTF-8.  @var{charset} can be any encoding
 supported by the system's @code{iconv} library routine.

@item -fwide-exec-charset=@var{charset}
@opindex fwide-exec-charset
+@cindex character set, wide execution
 Set the wide execution character set, used for wide string and
 character constants.  The default is UTF-32 or UTF-16, whichever
 corresponds to the width of @code{wchar_t}.  As with
@ -545,6 +547,7 @@ problems with encodings that do not fit exactly in @code{wchar_t}.

@item -finput-charset=@var{charset}
@opindex finput-charset
+@cindex character set, input
 Set the input character set, used for translation from the character
 set of the input file to the source character set used by GCC@.  If the
 locale does not specify, or GCC cannot get this information from the
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@ -3039,6 +3039,51 @@ Do not warn if a multicharacter constant (@samp{'FOOF'}) is used.
 Usually they indicate a typo in the user's code, as they have
 implementation-defined values, and should not be used in portable code.

+@item -Wnormalized=<none|id|nfc|nfkc>
+@opindex Wnormalized
+@cindex NFC
+@cindex NFKC
+@cindex character set, input normalization
+In ISO C and ISO C++, two identifiers are different if they are
+different sequences of characters.  However, sometimes when characters
+outside the basic ASCII character set are used, you can have two
+different character sequences that look the same.  To avoid confusion,
+the ISO 10646 standard sets out some @dfn{normalization rules} which
+when applied ensure that two sequences that look the same are turned into
+the same sequence.  GCC can warn you if you are using identifiers which
+have not been normalized; this option controls that warning.
+
+There are four levels of warning that GCC supports.  The default is
+@option{-Wnormalized=nfc}, which warns about any identifier which is
+not in the ISO 10646 ``C'' normalized form, @dfn{NFC}.  NFC is the
+recommended form for most uses.
+
+Unfortunately, there are some characters which ISO C and ISO C++ allow
+in identifiers that when turned into NFC aren't allowable as
+identifiers.  That is, there's no way to use these symbols in portable
+ISO C or C++ and have all your identifiers in NFC.
+@option{-Wnormalized=id} suppresses the warning for these characters.
+It is hoped that future versions of the standards involved will correct
+this, which is why this option is not the default.
+
+You can switch the warning off for all characters by writing
+@option{-Wnormalized=none}.  You would only want to do this if you
+were using some other normalization scheme (like ``D''), because
+otherwise you can easily create bugs that are literally impossible to see.
+
+Some characters in ISO 10646 have distinct meanings but look identical
+in some fonts or display methodologies, especially once formatting has
+been applied.  For instance @code{\u207F}, ``SUPERSCRIPT LATIN SMALL
+LETTER N'', will display just like a regular @code{n} which has been
+placed in a superscript.  ISO 10646 defines the @dfn{NFKC}
+normalisation scheme to convert all these into a standard form as
+well, and GCC will warn if your code is not in NFKC if you use
+@option{-Wnormalized=nfkc}.  This warning is comparable to warning
+about every identifier that contains the letter O because it might be
+confused with the digit 0, and so is not the default, but may be
+useful as a local coding convention if the programming environment is
+unable to be fixed to display these characters distinctly.
+
@item -Wno-deprecated-declarations
@opindex Wno-deprecated-declarations
 Do not warn about uses of functions, variables, and types marked as
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@ -1,3 +1,14 @@
+2005-03-14  Geoffrey Keating  <geoffk@apple.com>
+
+	* gcc.dg/cpp/normalize-1.c: New.
+	* gcc.dg/cpp/normalize-2.c: New.
+	* gcc.dg/cpp/normalize-3.c: New.
+	* gcc.dg/cpp/normalize-4.c: New.
+	* gcc.dg/cpp/ucnid-4.c: New.
+	* gcc.dg/cpp/ucnid-5.c: New.
+	* g++.dg/cpp/normalize-1.C: New.
+	* g++.dg/cpp/ucnid-1.C: New.
+
 2005-03-14  Alexandre Oliva  <aoliva@redhat.com>

 	* gcc.dg/pr18628.c: New.
--- a/gcc/testsuite/g++.dg/cpp/normalize-1.C
+++ b/gcc/testsuite/g++.dg/cpp/normalize-1.C
@ -0,0 +1,34 @@
+/* { dg-do preprocess } */
+/* { dg-options "-Wnormalized=id" } */
+
+\u00AA
+\u00B7
+\u0F43  /* { dg-warning "not in NFC" } */
+a\u05B8\u05B9\u05B9\u05BBb
+ a\u05BB\u05B9\u05B8\u05B9b  /* { dg-warning "not in NFC" } */
+\u09CB
+\u09C7\u09BE /* { dg-warning "not in NFC" } */
+\u0B4B
+\u0B47\u0B3E /* { dg-warning "not in NFC" } */
+\u0BCA
+\u0BC6\u0BBE /* { dg-warning "not in NFC" } */
+\u0BCB
+\u0BC7\u0BBE /* { dg-warning "not in NFC" } */
+\u0CCA
+\u0CC6\u0CC2 /* { dg-warning "not in NFC" } */
+\u0D4A
+\u0D46\u0D3E /* { dg-warning "not in NFC" } */
+\u0D4B
+\u0D47\u0D3E /* { dg-warning "not in NFC" } */
+
+K
+\u212A
+
+\u03AC
+\u1F71 /* { dg-warning "not in NFC" } */
+
+\uAC00
+\u1100\u1161
+\uAC01
+\u1100\u1161\u11A8
+\uAC00\u11A8
--- a/gcc/testsuite/g++.dg/cpp/ucnid-1.C
+++ b/gcc/testsuite/g++.dg/cpp/ucnid-1.C
@ -0,0 +1,17 @@
+/* { dg-do preprocess } */
+/* { dg-options "-pedantic" } */
+
+\u00AA /* { dg-error "not valid in an identifier" } */
+\u00AB /* { dg-error "not valid in an identifier" } */
+\u00B6 /* { dg-error "not valid in an identifier" } */
+\u00BA /* { dg-error "not valid in an identifier" } */
+\u00C0
+\u00D6
+\u0384
+
+\u0669 /* { dg-error "not valid in an identifier" } */
+A\u0669 /* { dg-error "not valid in an identifier" } */
+0\u00BA /* { dg-error "not valid in an identifier" } */
+0\u0669 /* { dg-error "not valid in an identifier" } */
+\u0E59
+A\u0E59
--- a/gcc/testsuite/gcc.dg/cpp/normalize-1.c
+++ b/gcc/testsuite/gcc.dg/cpp/normalize-1.c
@ -0,0 +1,34 @@
+/* { dg-do preprocess } */
+/* { dg-options "-std=c99" } */
+
+\u00AA
+\u00B7
+\u0F43  /* { dg-warning "not in NFC" } */
+a\u05B8\u05B9\u05B9\u05BBb
+ a\u05BB\u05B9\u05B8\u05B9b  /* { dg-warning "not in NFC" } */
+\u09CB
+\u09C7\u09BE /* { dg-warning "not in NFC" } */
+\u0B4B
+\u0B47\u0B3E /* { dg-warning "not in NFC" } */
+\u0BCA
+\u0BC6\u0BBE /* { dg-warning "not in NFC" } */
+\u0BCB
+\u0BC7\u0BBE /* { dg-warning "not in NFC" } */
+\u0CCA
+\u0CC6\u0CC2 /* { dg-warning "not in NFC" } */
+\u0D4A
+\u0D46\u0D3E /* { dg-warning "not in NFC" } */
+\u0D4B
+\u0D47\u0D3E /* { dg-warning "not in NFC" } */
+
+K
+\u212A /* { dg-warning "not in NFC" } */
+
+\u03AC
+\u1F71 /* { dg-warning "not in NFC" } */
+
+\uAC00
+\u1100\u1161 /* { dg-warning "not in NFC" } */
+\uAC01
+\u1100\u1161\u11A8 /* { dg-warning "not in NFC" } */
+\uAC00\u11A8 /* { dg-warning "not in NFC" } */
--- a/gcc/testsuite/gcc.dg/cpp/normalize-2.c
+++ b/gcc/testsuite/gcc.dg/cpp/normalize-2.c
@ -0,0 +1,34 @@
+/* { dg-do preprocess } */
+/* { dg-options "-std=c99 -Wnormalized=nfkc" } */
+
+\u00AA  /* { dg-warning "not in NFKC" } */
+\u00B7
+\u0F43  /* { dg-warning "not in NFC" } */
+a\u05B8\u05B9\u05B9\u05BBb
+ a\u05BB\u05B9\u05B8\u05B9b  /* { dg-warning "not in NFC" } */
+\u09CB
+\u09C7\u09BE /* { dg-warning "not in NFC" } */
+\u0B4B
+\u0B47\u0B3E /* { dg-warning "not in NFC" } */
+\u0BCA
+\u0BC6\u0BBE /* { dg-warning "not in NFC" } */
+\u0BCB
+\u0BC7\u0BBE /* { dg-warning "not in NFC" } */
+\u0CCA
+\u0CC6\u0CC2 /* { dg-warning "not in NFC" } */
+\u0D4A
+\u0D46\u0D3E /* { dg-warning "not in NFC" } */
+\u0D4B
+\u0D47\u0D3E /* { dg-warning "not in NFC" } */
+
+K
+\u212A /* { dg-warning "not in NFC" } */
+
+\u03AC
+\u1F71 /* { dg-warning "not in NFC" } */
+
+\uAC00
+\u1100\u1161 /* { dg-warning "not in NFC" } */
+\uAC01
+\u1100\u1161\u11A8 /* { dg-warning "not in NFC" } */
+\uAC00\u11A8 /* { dg-warning "not in NFC" } */
--- a/gcc/testsuite/gcc.dg/cpp/normalize-3.c
+++ b/gcc/testsuite/gcc.dg/cpp/normalize-3.c
@ -0,0 +1,34 @@
+/* { dg-do preprocess } */
+/* { dg-options "-std=c99 -Wnormalized=id" } */
+
+\u00AA
+\u00B7
+\u0F43  /* { dg-warning "not in NFC" } */
+a\u05B8\u05B9\u05B9\u05BBb
+ a\u05BB\u05B9\u05B8\u05B9b  /* { dg-warning "not in NFC" } */
+\u09CB
+\u09C7\u09BE /* { dg-warning "not in NFC" } */
+\u0B4B
+\u0B47\u0B3E /* { dg-warning "not in NFC" } */
+\u0BCA
+\u0BC6\u0BBE /* { dg-warning "not in NFC" } */
+\u0BCB
+\u0BC7\u0BBE /* { dg-warning "not in NFC" } */
+\u0CCA
+\u0CC6\u0CC2 /* { dg-warning "not in NFC" } */
+\u0D4A
+\u0D46\u0D3E /* { dg-warning "not in NFC" } */
+\u0D4B
+\u0D47\u0D3E /* { dg-warning "not in NFC" } */
+
+K
+\u212A
+
+\u03AC
+\u1F71 /* { dg-warning "not in NFC" } */
+
+\uAC00
+\u1100\u1161
+\uAC01
+\u1100\u1161\u11A8
+\uAC00\u11A8
--- a/gcc/testsuite/gcc.dg/cpp/normalize-4.c
+++ b/gcc/testsuite/gcc.dg/cpp/normalize-4.c
@ -0,0 +1,34 @@
+/* { dg-do preprocess } */
+/* { dg-options "-std=c99 -Wnormalized=none" } */
+
+\u00AA
+\u00B7
+\u0F43
+a\u05B8\u05B9\u05B9\u05BBb
+ a\u05BB\u05B9\u05B8\u05B9b
+\u09CB
+\u09C7\u09BE
+\u0B4B
+\u0B47\u0B3E
+\u0BCA
+\u0BC6\u0BBE
+\u0BCB
+\u0BC7\u0BBE
+\u0CCA
+\u0CC6\u0CC2
+\u0D4A
+\u0D46\u0D3E
+\u0D4B
+\u0D47\u0D3E
+
+K
+\u212A
+
+\u03AC
+\u1F71
+
+\uAC00
+\u1100\u1161
+\uAC01
+\u1100\u1161\u11A8
+\uAC00\u11A8
--- a/gcc/testsuite/gcc.dg/cpp/ucnid-4.c
+++ b/gcc/testsuite/gcc.dg/cpp/ucnid-4.c
@ -0,0 +1,17 @@
+/* { dg-do preprocess } */
+/* { dg-options "-std=c99" } */
+
+\u00AA
+\u00AB /* { dg-error "not valid in an identifier" } */
+\u00B6 /* { dg-error "not valid in an identifier" } */
+\u00BA
+\u00C0
+\u00D6
+\u0384
+
+\u0669 /* { dg-error "not valid at the start of an identifier" } */
+A\u0669
+0\u00BA
+0\u0669
+\u0E59 /* { dg-error "not valid at the start of an identifier" } */
+A\u0E59
--- a/gcc/testsuite/gcc.dg/cpp/ucnid-5.c
+++ b/gcc/testsuite/gcc.dg/cpp/ucnid-5.c
@ -0,0 +1,17 @@
+/* { dg-do preprocess } */
+/* { dg-options "-std=c99 -pedantic" } */
+
+\u00AA
+\u00AB /* { dg-error "not valid in an identifier" } */
+\u00B6 /* { dg-error "not valid in an identifier" } */
+\u00BA
+\u00C0
+\u00D6
+\u0384 /* { dg-error "not valid in an identifier" } */
+
+\u0669 /* { dg-error "not valid at the start of an identifier" } */
+A\u0669
+0\u00BA
+0\u0669
+\u0E59 /* { dg-error "not valid at the start of an identifier" } */
+A\u0E59
--- a/libcpp/ChangeLog
+++ b/libcpp/ChangeLog
@ -1,3 +1,32 @@
+2005-03-14  Geoffrey Keating  <geoffk@apple.com>
+
+	* init.c (cpp_create_reader): Default warn_normalize to normalized_C.
+	* charset.c: Update for new format of ucnid.h.
+	(ucn_valid_in_identifier): Update for new format of ucnid.h.
+	Add NST parameter, and update it; update callers.
+	(cpp_valid_ucn): Add NST parameter, update callers.  Replace abort
+	with cpp_error.
+	(convert_ucn): Pass normalize_state to cpp_valid_ucn.
+	* internal.h (struct normalize_state): New.
+	(INITIAL_NORMALIZE_STATE): New.
+	(NORMALIZE_STATE_RESULT): New.
+	(NORMALIZE_STATE_UPDATE_IDNUM): New.
+	(_cpp_valid_ucn): New.
+	* lex.c (warn_about_normalization): New.
+	(forms_identifier_p): Add normalize_state parameter, update callers.
+	(lex_identifier): Add normalize_state parameter, update callers.  Keep
+	the state current.
+	(lex_number): Likewise.
+	(_cpp_lex_direct): Pass normalize_state to subroutines.  Check
+	it with warn_about_normalization.
+	* makeucnid.c: New.
+	* ucnid.h: Replace.
+	* ucnid.pl: Remove.
+	* ucnid.tab: Make appropriate for input to makeucnid.c.  Remove
+	comments about obsolete version of C++.
+	* include/cpplib.h (enum cpp_normalize_level): New.
+	(struct cpp_options): Add warn_normalize field.
+
 2005-03-11  Geoffrey Keating  <geoffk@apple.com>

 	* directives.c (glue_header_name): Update call to cpp_spell_token.
--- a/libcpp/charset.c
+++ b/libcpp/charset.c
@ -22,7 +22,6 @@ Foundation, 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.  */
 #include "system.h"
 #include "cpplib.h"
 #include "internal.h"
-#include "ucnid.h"

 /* Character set handling for C-family languages.

@ -786,43 +785,128 @@ width_to_mask (size_t width)
    return ((size_t) 1 << width) - 1;
 }

+/* A large table of unicode character information.  */
+enum {
+  /* Valid in a C99 identifier?  */
+  C99 = 1,
+  /* Valid in a C99 identifier, but not as the first character?  */
+  DIG = 2,
+  /* Valid in a C++ identifier?  */
+  CXX = 4,
+  /* NFC representation is not valid in an identifier?  */
+  CID = 8,
+  /* Might be valid NFC form?  */
+  NFC = 16,
+  /* Might be valid NFKC form?  */
+  NKC = 32,
+  /* Certain preceding characters might make it not valid NFC/NKFC form?  */
+  CTX = 64
+};
+
+static const struct {
+  /* Bitmap of flags above.  */
+  unsigned char flags;
+  /* Combining class of the character.  */
+  unsigned char combine;
+  /* Last character in the range described by this entry.  */
+  unsigned short end;
+} ucnranges[] = {
+#include "ucnid.h"
+};
+
 /* Returns 1 if C is valid in an identifier, 2 if C is valid except at
   the start of an identifier, and 0 if C is not valid in an
   identifier.  We assume C has already gone through the checks of
-   _cpp_valid_ucn.  The algorithm is a simple binary search on the
-   table defined in cppucnid.h.  */
+   _cpp_valid_ucn.  Also update NST for C if returning nonzero.  The
+   algorithm is a simple binary search on the table defined in
+   ucnid.h.  */

 static int
-ucn_valid_in_identifier (cpp_reader *pfile, cppchar_t c)
+ucn_valid_in_identifier (cpp_reader *pfile, cppchar_t c,
+			 struct normalize_state *nst)
 {
  int mn, mx, md;

-  mn = -1;
-  mx = ARRAY_SIZE (ucnranges);
-  while (mx - mn > 1)
+  if (c > 0xFFFF)
+    return 0;
+
+  mn = 0;
+  mx = ARRAY_SIZE (ucnranges) - 1;
+  while (mx != mn)
    {
      md = (mn + mx) / 2;
-      if (c < ucnranges[md].lo)
+      if (c <= ucnranges[md].end)
 	mx = md;
-      else if (c > ucnranges[md].hi)
-	mn = md;
      else
-	goto found;
+	mn = md + 1;
    }
-  return 0;

- found:
  /* When -pedantic, we require the character to have been listed by
     the standard for the current language.  Otherwise, we accept the
     union of the acceptable sets for C++98 and C99.  */
+  if (! (ucnranges[mn].flags & (C99 | CXX)))
+      return 0;
+
  if (CPP_PEDANTIC (pfile)
-      && ((CPP_OPTION (pfile, c99) && !(ucnranges[md].flags & C99))
+      && ((CPP_OPTION (pfile, c99) && !(ucnranges[mn].flags & C99))
 	  || (CPP_OPTION (pfile, cplusplus)
-	      && !(ucnranges[md].flags & CXX))))
+	      && !(ucnranges[mn].flags & CXX))))
    return 0;

+  /* Update NST.  */
+  if (ucnranges[mn].combine != 0 && ucnranges[mn].combine < nst->prev_class)
+    nst->level = normalized_none;
+  else if (ucnranges[mn].flags & CTX)
+    {
+      bool safe;
+      cppchar_t p = nst->previous;
+
+      /* Easy cases from Bengali, Oriya, Tamil, Jannada, and Malayalam.  */
+      if (c == 0x09BE)
+	safe = p != 0x09C7;  /* Use 09CB instead of 09C7 09BE.  */
+      else if (c == 0x0B3E)
+	safe = p != 0x0B47;  /* Use 0B4B instead of 0B47 0B3E.  */
+      else if (c == 0x0BBE)
+	safe = p != 0x0BC6 && p != 0x0BC7;  /* Use 0BCA/0BCB instead.  */
+      else if (c == 0x0CC2)
+	safe = p != 0x0CC6;  /* Use 0CCA instead of 0CC6 0CC2.  */
+      else if (c == 0x0D3E)
+	safe = p != 0x0D46 && p != 0x0D47;  /* Use 0D4A/0D4B instead.  */
+      /* For Hangul, characters in the range AC00-D7A3 are NFC/NFKC,
+	 and are combined algorithmically from a sequence of the form
+	 1100-1112 1161-1175 11A8-11C2
+	 (if the third is not present, it is treated as 11A7, which is not
+	 really a valid character).
+	 Unfortunately, C99 allows (only) the NFC form, but C++ allows
+	 only the combining characters.  */
+      else if (c >= 0x1161 && c <= 0x1175)
+	safe = p < 0x1100 || p > 0x1112;
+      else if (c >= 0x11A8 && c <= 0x11C2)
+	safe = (p < 0xAC00 || p > 0xD7A3 || (p - 0xAC00) % 28 != 0);
+      else
+	{
+	  /* Uh-oh, someone updated ucnid.h without updating this code.  */
+	  cpp_error (pfile, CPP_DL_ICE, "Character %x might not be NFKC", c);
+	  safe = true;
+	}
+      if (!safe && c < 0x1161)
+	nst->level = normalized_none;
+      else if (!safe)
+	nst->level = MAX (nst->level, normalized_identifier_C);
+    }
+  else if (ucnranges[mn].flags & NKC)
+    ;
+  else if (ucnranges[mn].flags & NFC)
+    nst->level = MAX (nst->level, normalized_C);
+  else if (ucnranges[mn].flags & CID)
+    nst->level = MAX (nst->level, normalized_identifier_C);
+  else
+    nst->level = normalized_none;
+  nst->previous = c;
+  nst->prev_class = ucnranges[mn].combine;
+
  /* In C99, UCN digits may not begin identifiers.  */
-  if (CPP_OPTION (pfile, c99) && (ucnranges[md].flags & DIG))
+  if (CPP_OPTION (pfile, c99) && (ucnranges[mn].flags & DIG))
    return 2;

  return 1;
@ -853,7 +937,8 @@ ucn_valid_in_identifier (cpp_reader *pfile, cppchar_t c)

 cppchar_t
 _cpp_valid_ucn (cpp_reader *pfile, const uchar **pstr,
-		const uchar *limit, int identifier_pos)
+		const uchar *limit, int identifier_pos,
+		struct normalize_state *nst)
 {
  cppchar_t result, c;
  unsigned int length;
@ -873,7 +958,10 @@ _cpp_valid_ucn (cpp_reader *pfile, const uchar **pstr,
  else if (str[-1] == 'U')
    length = 8;
  else
-    abort();
+    {
+      cpp_error (pfile, CPP_DL_ICE, "In _cpp_valid_ucn but not a UCN");
+      length = 4;
+    }

  result = 0;
  do
@ -915,10 +1003,11 @@ _cpp_valid_ucn (cpp_reader *pfile, const uchar **pstr,
 	  CPP_OPTION (pfile, warn_dollars) = 0;
 	  cpp_error (pfile, CPP_DL_PEDWARN, "'$' in identifier or number");
 	}
+      NORMALIZE_STATE_UPDATE_IDNUM (nst);
    }
  else if (identifier_pos)
    {
-      int validity = ucn_valid_in_identifier (pfile, result);
+      int validity = ucn_valid_in_identifier (pfile, result, nst);

      if (validity == 0)
 	cpp_error (pfile, CPP_DL_ERROR,
@ -950,9 +1039,10 @@ convert_ucn (cpp_reader *pfile, const uchar *from, const uchar *limit,
  int rval;
  struct cset_converter cvt
    = wide ? pfile->wide_cset_desc : pfile->narrow_cset_desc;
+  struct normalize_state nst = INITIAL_NORMALIZE_STATE;

  from++;  /* Skip u/U.  */
-  ucn = _cpp_valid_ucn (pfile, &from, limit, 0);
+  ucn = _cpp_valid_ucn (pfile, &from, limit, 0, &nst);

  rval = one_cppchar_to_utf8 (ucn, &bufp, &bytesleft);
  if (rval)
--- a/libcpp/include/cpplib.h
+++ b/libcpp/include/cpplib.h
@ -236,6 +236,19 @@ typedef CPPCHAR_SIGNED_T cppchar_signed_t;
 /* Style of header dependencies to generate.  */
 enum cpp_deps_style { DEPS_NONE = 0, DEPS_USER, DEPS_SYSTEM };

+/* The possible normalization levels, from most restrictive to least.  */
+enum cpp_normalize_level {
+  /* In NFKC.  */
+  normalized_KC = 0,
+  /* In NFC.  */
+  normalized_C,
+  /* In NFC, except for subsequences where being in NFC would make
+     the identifier invalid.  */
+  normalized_identifier_C,
+  /* Not normalized at all.  */
+  normalized_none
+};
+
 /* This structure is nested inside struct cpp_reader, and
   carries all the options visible to the command line.  */
 struct cpp_options
@ -373,6 +386,10 @@ struct cpp_options
  /* Holds the name of the input character set.  */
  const char *input_charset;

+  /* The minimum permitted level of normalization before a warning
+     is generated.  */
+  enum cpp_normalize_level warn_normalize;
+
  /* True to warn about precompiled header files we couldn't use.  */
  bool warn_invalid_pch;

--- a/libcpp/init.c
+++ b/libcpp/init.c
@ -153,6 +153,7 @@ cpp_create_reader (enum c_lang lang, hash_table *table,
  CPP_OPTION (pfile, dollars_in_ident) = 1;
  CPP_OPTION (pfile, warn_dollars) = 1;
  CPP_OPTION (pfile, warn_variadic_macros) = 1;
+  CPP_OPTION (pfile, warn_normalize) = normalized_C;

  /* Default CPP arithmetic to something sensible for the host for the
     benefit of dumb users like fix-header.  */
--- a/libcpp/internal.h
+++ b/libcpp/internal.h
@ -564,8 +564,31 @@ extern unsigned char *_cpp_copy_replacement_text (const cpp_macro *,
 extern size_t _cpp_replacement_text_len (const cpp_macro *);

 /* In charset.c.  */
+
+/* The normalization state at this point in the sequence.
+   It starts initialized to all zeros, and at the end
+   'level' is the normalization level of the sequence.  */
+
+struct normalize_state 
+{
+  /* The previous character.  */
+  cppchar_t previous;
+  /* The combining class of the previous character.  */
+  unsigned char prev_class;
+  /* The lowest normalization level so far.  */
+  enum cpp_normalize_level level;
+};
+#define INITIAL_NORMALIZE_STATE { 0, 0, normalized_KC }
+#define NORMALIZE_STATE_RESULT(st) ((st)->level)
+
+/* We saw a character that matches ISIDNUM(), update a
+   normalize_state appropriately.  */
+#define NORMALIZE_STATE_UPDATE_IDNUM(st) \
+  ((st)->previous = 0, (st)->prev_class = 0)
+
 extern cppchar_t _cpp_valid_ucn (cpp_reader *, const unsigned char **,
-				 const unsigned char *, int);
+				 const unsigned char *, int,
+				 struct normalize_state *state);
 extern void _cpp_destroy_iconv (cpp_reader *);
 extern unsigned char *_cpp_convert_input (cpp_reader *, const char *,
 					  unsigned char *, size_t, size_t,
--- a/libcpp/lex.c
+++ b/libcpp/lex.c
@ -53,9 +53,6 @@ static const struct token_spelling token_spellings[N_TTYPES] = { TTYPE_TABLE };
 static void add_line_note (cpp_buffer *, const uchar *, unsigned int);
 static int skip_line_comment (cpp_reader *);
 static void skip_whitespace (cpp_reader *, cppchar_t);
-static cpp_hashnode *lex_identifier (cpp_reader *, const uchar *, bool);
-static void lex_number (cpp_reader *, cpp_string *);
-static bool forms_identifier_p (cpp_reader *, int);
 static void lex_string (cpp_reader *, cpp_token *, const uchar *);
 static void save_comment (cpp_reader *, cpp_token *, const uchar *, cppchar_t);
 static void create_literal (cpp_reader *, cpp_token *, const uchar *,
@ -430,10 +427,36 @@ name_p (cpp_reader *pfile, const cpp_string *string)
  return 1;
 }

+/* After parsing an identifier or other sequence, produce a warning about
+   sequences not in NFC/NFKC.  */
+static void
+warn_about_normalization (cpp_reader *pfile, 
+			  const cpp_token *token,
+			  const struct normalize_state *s)
+{
+  if (CPP_OPTION (pfile, warn_normalize) < NORMALIZE_STATE_RESULT (s)
+      && !pfile->state.skipping)
+    {
+      /* Make sure that the token is printed using UCNs, even
+	 if we'd otherwise happily print UTF-8.  */
+      unsigned char *buf = xmalloc (cpp_token_len (token));
+      size_t sz;
+
+      sz = cpp_spell_token (pfile, token, buf, false) - buf;
+      if (NORMALIZE_STATE_RESULT (s) == normalized_C)
+	cpp_error_with_line (pfile, CPP_DL_WARNING, token->src_loc, 0,
+			     "`%.*s' is not in NFKC", sz, buf);
+      else
+	cpp_error_with_line (pfile, CPP_DL_WARNING, token->src_loc, 0,
+			     "`%.*s' is not in NFC", sz, buf);
+    }
+}
+
 /* Returns TRUE if the sequence starting at buffer->cur is invalid in
   an identifier.  FIRST is TRUE if this starts an identifier.  */
 static bool
-forms_identifier_p (cpp_reader *pfile, int first)
+forms_identifier_p (cpp_reader *pfile, int first,
+		    struct normalize_state *state)
 {
  cpp_buffer *buffer = pfile->buffer;

@ -457,7 +480,8 @@ forms_identifier_p (cpp_reader *pfile, int first)
      && (buffer->cur[1] == 'u' || buffer->cur[1] == 'U'))
    {
      buffer->cur += 2;
-      if (_cpp_valid_ucn (pfile, &buffer->cur, buffer->rlimit, 1 + !first))
+      if (_cpp_valid_ucn (pfile, &buffer->cur, buffer->rlimit, 1 + !first,
+			  state))
 	return true;
      buffer->cur -= 2;
    }
@ -467,7 +491,8 @@ forms_identifier_p (cpp_reader *pfile, int first)

 /* Lex an identifier starting at BUFFER->CUR - 1.  */
 static cpp_hashnode *
-lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn)
+lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn,
+		struct normalize_state *nst)
 {
  cpp_hashnode *result;
  const uchar *cur;
@ -482,13 +507,16 @@ lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn)
 	cur++;
      }
  pfile->buffer->cur = cur;
-  if (starts_ucn || forms_identifier_p (pfile, false))
+  if (starts_ucn || forms_identifier_p (pfile, false, nst))
    {
      /* Slower version for identifiers containing UCNs (or $).  */
      do {
 	while (ISIDNUM (*pfile->buffer->cur))
-	  pfile->buffer->cur++;
-      } while (forms_identifier_p (pfile, false));
+	  {
+	    pfile->buffer->cur++;
+	    NORMALIZE_STATE_UPDATE_IDNUM (nst);
+	  }
+      } while (forms_identifier_p (pfile, false, nst));
      result = _cpp_interpret_identifier (pfile, base,
 					  pfile->buffer->cur - base);
    }
@ -524,7 +552,8 @@ lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn)

 /* Lex a number to NUMBER starting at BUFFER->CUR - 1.  */
 static void
-lex_number (cpp_reader *pfile, cpp_string *number)
+lex_number (cpp_reader *pfile, cpp_string *number,
+	    struct normalize_state *nst)
 {
  const uchar *cur;
  const uchar *base;
@ -537,11 +566,14 @@ lex_number (cpp_reader *pfile, cpp_string *number)

      /* N.B. ISIDNUM does not include $.  */
      while (ISIDNUM (*cur) || *cur == '.' || VALID_SIGN (*cur, cur[-1]))
-	cur++;
+	{
+	  cur++;
+	  NORMALIZE_STATE_UPDATE_IDNUM (nst);
+	}

      pfile->buffer->cur = cur;
    }
-  while (forms_identifier_p (pfile, false));
+  while (forms_identifier_p (pfile, false, nst));

  number->len = cur - base;
  dest = _cpp_unaligned_alloc (pfile, number->len + 1);
@ -897,9 +929,13 @@ _cpp_lex_direct (cpp_reader *pfile)

    case '0': case '1': case '2': case '3': case '4':
    case '5': case '6': case '7': case '8': case '9':
-      result->type = CPP_NUMBER;
-      lex_number (pfile, &result->val.str);
-      break;
+      {
+	struct normalize_state nst = INITIAL_NORMALIZE_STATE;
+	result->type = CPP_NUMBER;
+	lex_number (pfile, &result->val.str, &nst);
+	warn_about_normalization (pfile, result, &nst);
+	break;
+      }

    case 'L':
      /* 'L' may introduce wide characters or strings.  */
@ -922,7 +958,12 @@ _cpp_lex_direct (cpp_reader *pfile)
    case 'S': case 'T': case 'U': case 'V': case 'W': case 'X':
    case 'Y': case 'Z':
      result->type = CPP_NAME;
-      result->val.node = lex_identifier (pfile, buffer->cur - 1, false);
+      {
+	struct normalize_state nst = INITIAL_NORMALIZE_STATE;
+	result->val.node = lex_identifier (pfile, buffer->cur - 1, false,
+					   &nst);
+	warn_about_normalization (pfile, result, &nst);
+      }

      /* Convert named operators to their proper types.  */
      if (result->val.node->flags & NODE_OPERATOR)
@ -1067,8 +1108,10 @@ _cpp_lex_direct (cpp_reader *pfile)
      result->type = CPP_DOT;
      if (ISDIGIT (*buffer->cur))
 	{
+	  struct normalize_state nst = INITIAL_NORMALIZE_STATE;
 	  result->type = CPP_NUMBER;
-	  lex_number (pfile, &result->val.str);
+	  lex_number (pfile, &result->val.str, &nst);
+	  warn_about_normalization (pfile, result, &nst);
 	}
      else if (*buffer->cur == '.' && buffer->cur[1] == '.')
 	buffer->cur += 2, result->type = CPP_ELLIPSIS;
@ -1151,11 +1194,13 @@ _cpp_lex_direct (cpp_reader *pfile)
    case '\\':
      {
 	const uchar *base = --buffer->cur;
+	struct normalize_state nst = INITIAL_NORMALIZE_STATE;

-	if (forms_identifier_p (pfile, true))
+	if (forms_identifier_p (pfile, true, &nst))
 	  {
 	    result->type = CPP_NAME;
-	    result->val.node = lex_identifier (pfile, base, true);
+	    result->val.node = lex_identifier (pfile, base, true, &nst);
+	    warn_about_normalization (pfile, result, &nst);
 	    break;
 	  }
 	buffer->cur++;
--- a/libcpp/makeucnid.c
+++ b/libcpp/makeucnid.c
@ -0,0 +1,342 @@
+/* Make ucnid.h from various sources.
+   Copyright (C) 2005 Free Software Foundation, Inc.
+
+This program is free software; you can redistribute it and/or modify it
+under the terms of the GNU General Public License as published by the
+Free Software Foundation; either version 2, or (at your option) any
+later version.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with this program; if not, write to the Free Software
+Foundation, 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.  */
+
+/* Run this program as
+   ./makeucnid ucnid.tab UnicodeData.txt DerivedNormalizationProps.txt \
+       > ucnid.h
+*/
+
+#include <stdio.h>
+#include <string.h>
+#include <ctype.h>
+#include <stdbool.h>
+#include <stdlib.h>
+
+enum {
+  C99 = 1,
+  CXX = 2,
+  digit = 4,
+  not_NFC = 8,
+  not_NFKC = 16,
+  maybe_not_NFC = 32
+};
+
+static unsigned flags[65536];
+static unsigned short decomp[65536][2];
+static unsigned char combining_value[65536];
+
+/* Die!  */
+
+static void
+fail (const char *s)
+{
+  fprintf (stderr, "%s\n", s);
+  exit (1);
+}
+
+/* Read ucnid.tab and set the C99 and CXX flags in header[].  */
+
+static void
+read_ucnid (const char *fname)
+{
+  FILE *f = fopen (fname, "r");
+  unsigned fl = 0;
+  
+  if (!f)
+    fail ("opening ucnid.tab");
+  for (;;)
+    {
+      char line[256];
+
+      if (!fgets (line, sizeof (line), f))
+	break;
+      if (strcmp (line, "[C99]\n") == 0)
+	fl = C99;
+      else if (strcmp (line, "[CXX]\n") == 0)
+	fl = CXX;
+      else if (isxdigit (line[0]))
+	{
+	  char *l = line;
+	  while (*l)
+	    {
+	      unsigned long start, end;
+	      char *endptr;
+	      start = strtoul (l, &endptr, 16);
+	      if (endptr == l || (*endptr != '-' && ! isspace (*endptr)))
+		fail ("parsing ucnid.tab [1]");
+	      l = endptr;
+	      if (*l != '-')
+		end = start;
+	      else
+		{
+		  end = strtoul (l + 1, &endptr, 16);
+		  if (end < start)
+		    fail ("parsing ucnid.tab, end before start");
+		  l = endptr;
+		  if (! isspace (*l))
+		    fail ("parsing ucnid.tab, junk after range");
+		}
+	      while (isspace (*l))
+		l++;
+	      if (end > 0xFFFF)
+		fail ("parsing ucnid.tab, end too large");
+	      while (start <= end)
+		flags[start++] |= fl;
+	    }
+	}
+    }
+  if (ferror (f))
+    fail ("reading ucnid.tab");
+  fclose (f);
+}
+
+/* Read UnicodeData.txt and set the 'digit' flag, and
+   also fill in the 'decomp' table to be the decompositions of
+   characters for which both the character decomposed and all the code
+   points in the decomposition are either C99 or CXX.  */
+
+static void
+read_table (char *fname)
+{
+  FILE * f = fopen (fname, "r");
+  
+  if (!f)
+    fail ("opening UnicodeData.txt");
+  for (;;)
+    {
+      char line[256];
+      unsigned long codepoint, this_decomp[4];
+      char *l;
+      int i;
+      int decomp_useful;
+
+      if (!fgets (line, sizeof (line), f))
+	break;
+      codepoint = strtoul (line, &l, 16);
+      if (l == line || *l != ';')
+	fail ("parsing UnicodeData.txt, reading code point");
+      if (codepoint > 0xffff || ! (flags[codepoint] & (C99 | CXX)))
+	continue;
+
+      do {
+	l++;
+      } while (*l != ';');
+      /* Category value; things starting with 'N' are numbers of some
+	 kind.  */
+      if (*++l == 'N')
+	flags[codepoint] |= digit;
+
+      do {
+	l++;
+      } while (*l != ';');
+      /* Canonical combining class; in NFC/NFKC, they must be increasing
+	 (or zero).  */
+      if (! isdigit (*++l))
+	fail ("parsing UnicodeData.txt, combining class not number");
+      combining_value[codepoint] = strtoul (l, &l, 10);
+      if (*l++ != ';')
+	fail ("parsing UnicodeData.txt, junk after combining class");
+	
+      /* Skip over bidi value.  */
+      do {
+	l++;
+      } while (*l != ';');
+      
+      /* Decomposition mapping.  */
+      decomp_useful = flags[codepoint];
+      if (*++l == '<')  /* Compatibility mapping. */
+	continue;
+      for (i = 0; i < 4; i++)
+	{
+	  if (*l == ';')
+	    break;
+	  if (!isxdigit (*l))
+	    fail ("parsing UnicodeData.txt, decomposition format");
+	  this_decomp[i] = strtoul (l, &l, 16);
+	  decomp_useful &= flags[this_decomp[i]];
+	  while (isspace (*l))
+	    l++;
+	}
+      if (i > 2)  /* Decomposition too long.  */
+	fail ("parsing UnicodeData.txt, decomposition too long");
+      if (decomp_useful)
+	while (--i >= 0)
+	  decomp[codepoint][i] = this_decomp[i];
+    }
+  if (ferror (f))
+    fail ("reading UnicodeData.txt");
+  fclose (f);
+}
+
+/* Read DerivedNormalizationProps.txt and set the flags that say whether
+   a character is in NFC, NFKC, or is context-dependent.  */
+
+static void
+read_derived (const char *fname)
+{
+  FILE * f = fopen (fname, "r");
+  
+  if (!f)
+    fail ("opening DerivedNormalizationProps.txt");
+  for (;;)
+    {
+      char line[256];
+      unsigned long start, end;
+      char *l;
+      bool not_NFC_p, not_NFKC_p, maybe_not_NFC_p;
+
+      if (!fgets (line, sizeof (line), f))
+	break;
+      not_NFC_p = (strstr (line, "; NFC_QC; N") != NULL);
+      not_NFKC_p = (strstr (line, "; NFKC_QC; N") != NULL);
+      maybe_not_NFC_p = (strstr (line, "; NFC_QC; M") != NULL);
+      if (! not_NFC_p && ! not_NFKC_p && ! maybe_not_NFC_p)
+	continue;
+	
+      start = strtoul (line, &l, 16);
+      if (l == line)
+	fail ("parsing DerivedNormalizationProps.txt, reading start");
+      if (start > 0xffff)
+	continue;
+      if (*l == '.' && l[1] == '.')
+	end = strtoul (l + 2, &l, 16);
+      else
+	end = start;
+
+      while (start <= end)
+	flags[start++] |= ((not_NFC_p ? not_NFC : 0) 
+			   | (not_NFKC_p ? not_NFKC : 0)
+			   | (maybe_not_NFC_p ? maybe_not_NFC : 0)
+			   );
+    }
+  if (ferror (f))
+    fail ("reading DerivedNormalizationProps.txt");
+  fclose (f);
+}
+
+/* Write out the table.
+   The table consists of two words per entry.  The first word is the flags
+   for the unicode code points up to and including the second word.  */
+
+static void
+write_table (void)
+{
+  unsigned i;
+  unsigned last_flag = flags[0];
+  bool really_safe = decomp[0][0] == 0;
+  unsigned char last_combine = combining_value[0];
+  
+  for (i = 1; i <= 65536; i++)
+    if (i == 65536
+	|| (flags[i] != last_flag && ((flags[i] | last_flag) & (C99 | CXX)))
+	|| really_safe != (decomp[i][0] == 0)
+	|| combining_value[i] != last_combine)
+      {
+	printf ("{ %s|%s|%s|%s|%s|%s|%s, %3d, %#06x },\n",
+		last_flag & C99 ? "C99" : "  0",
+		last_flag & digit ? "DIG" : "  0",
+		last_flag & CXX ? "CXX" : "  0",
+		really_safe ? "CID" : "  0",
+		last_flag & not_NFC ? "  0" : "NFC",
+		last_flag & not_NFKC ? "  0" : "NKC",
+		last_flag & maybe_not_NFC ? "CTX" : "  0",
+		combining_value[i - 1],
+		i - 1);
+	last_flag = flags[i];
+	last_combine = combining_value[0];
+	really_safe = decomp[i][0] == 0;
+      }
+}
+
+/* Print out the huge copyright notice.  */
+
+static void
+write_copyright (void)
+{
+  static const char copyright[] = "\
+/* Unicode characters and various properties.\n\
+   Copyright (C) 2003, 2005 Free Software Foundation, Inc.\n\
+\n\
+   This program is free software; you can redistribute it and/or modify it\n\
+   under the terms of the GNU General Public License as published by the\n\
+   Free Software Foundation; either version 2, or (at your option) any\n\
+   later version.\n\
+\n\
+   This program is distributed in the hope that it will be useful,\n\
+   but WITHOUT ANY WARRANTY; without even the implied warranty of\n\
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n\
+   GNU General Public License for more details.\n\
+\n\
+   You should have received a copy of the GNU General Public License\n\
+   along with this program; if not, write to the Free Software\n\
+   Foundation, 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.\n\
+\n\
+\n\
+   Copyright (C) 1991-2005 Unicode, Inc.  All rights reserved.\n\
+   Distributed under the Terms of Use in\n\
+   http://www.unicode.org/copyright.html.\n\
+\n\
+   Permission is hereby granted, free of charge, to any person\n\
+   obtaining a copy of the Unicode data files and any associated\n\
+   documentation (the \"Data Files\") or Unicode software and any\n\
+   associated documentation (the \"Software\") to deal in the Data Files\n\
+   or Software without restriction, including without limitation the\n\
+   rights to use, copy, modify, merge, publish, distribute, and/or\n\
+   sell copies of the Data Files or Software, and to permit persons to\n\
+   whom the Data Files or Software are furnished to do so, provided\n\
+   that (a) the above copyright notice(s) and this permission notice\n\
+   appear with all copies of the Data Files or Software, (b) both the\n\
+   above copyright notice(s) and this permission notice appear in\n\
+   associated documentation, and (c) there is clear notice in each\n\
+   modified Data File or in the Software as well as in the\n\
+   documentation associated with the Data File(s) or Software that the\n\
+   data or software has been modified.\n\
+\n\
+   THE DATA FILES AND SOFTWARE ARE PROVIDED \"AS IS\", WITHOUT WARRANTY\n\
+   OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE\n\
+   WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND\n\
+   NONINFRINGEMENT OF THIRD PARTY RIGHTS. IN NO EVENT SHALL THE\n\
+   COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE BE LIABLE FOR\n\
+   ANY CLAIM, OR ANY SPECIAL INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY\n\
+   DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,\n\
+   WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS\n\
+   ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE\n\
+   OF THE DATA FILES OR SOFTWARE.\n\
+\n\
+   Except as contained in this notice, the name of a copyright holder\n\
+   shall not be used in advertising or otherwise to promote the sale,\n\
+   use or other dealings in these Data Files or Software without prior\n\
+   written authorization of the copyright holder.  */\n";
+   
+   puts (copyright);
+}
+
+/* Main program.  */
+
+int
+main(int argc, char ** argv)
+{
+  if (argc != 4)
+    fail ("too few arguments to makeucn");
+  read_ucnid (argv[1]);
+  read_table (argv[2]);
+  read_derived (argv[3]);
+
+  write_copyright ();
+  write_table ();
+  return 0;
+}
--- a/libcpp/ucnid.h
+++ b/libcpp/ucnid.h
--- a/libcpp/ucnid.pl
+++ b/libcpp/ucnid.pl
@ -1,130 +0,0 @@
-#! /usr/bin/perl -w
-use strict;
-
-# Convert cppucnid.tab to cppucnid.h.  We use two arrays of length
-# 65536 to represent the table, since this is nice and simple.  The
-# first array holds the tags indicating which ranges are valid in
-# which contexts.  The second array holds the language name associated
-# with each element.
-
-our(@tags, @names);
-@tags = ("") x 65536;
-@names = ("") x 65536;
-
-
-# Array mapping tag numbers to standard #defines
-our @stds;
-
-# Current standard and language
-our($curstd, $curlang);
-
-# First block of the file is a template to be saved for later.
-our @template;
-
-while (<>) {
-    chomp;
-    last if $_ eq '%%';
-    push @template, $_;
-};
-
-# Second block of the file is the UCN tables.
-# The format looks like this:
-#
-# [std]
-#
-# ; language
-# xxxx-xxxx xxxx xxxx-xxxx ....
-#
-# with comment lines starting with #.
-
-while (<>) {
-    chomp;
-    /^#/ and next;
-    /^\s*$/ and next;
-    /^\[(.+)\]$/ and do {
-	$curstd = $1;
- 	next;
-    };
-    /^; (.+)$/ and do {
-	$curlang = $1;
-	next;
-    };
-
-    process_range(split);
-}
-
-# Print out the template, inserting as requested.
-$\ = "\n";
-for (@template) {
-    print("/* Automatically generated from cppucnid.tab, do not edit */"),
-        next if $_ eq "[dne]";
-    print_table(), next if $_ eq "[table]";
-    print;
-}
-
-sub print_table {
-    my($lo, $hi);
-    my $prevname = "";
-
-    for ($lo = 0; $lo <= $#tags; $lo = $hi) {
-	$hi = $lo;
-	$hi++ while $hi <= $#tags
-	    && $tags[$hi] eq $tags[$lo]
-	    && $names[$hi] eq $names[$lo];
-
-	# Range from $lo to $hi-1.
-	# Don't make entries for ranges that are not valid idchars.
-	next if ($tags[$lo] eq "");
-	my $tag = $tags[$lo];
-        $tag = "    ".$tag if $tag =~ /^C99/;
-
-	if ($names[$lo] eq $prevname) {
-	    printf("  { 0x%04x, 0x%04x, %-11s },\n",
-		   $lo, $hi-1, $tag);
-	} else {
-	    printf("  { 0x%04x, 0x%04x, %-11s },  /* %s */\n",
-		   $lo, $hi-1, $tag, $names[$lo]);
-	}
-	$prevname = $names[$lo];
-    }
-}
-
-# The line is a list of four-digit hexadecimal numbers or
-# pairs of such numbers.  Each is a valid identifier character
-# from the given language, under the given standard.
-sub process_range {
-    for my $range (@_) {
-	if ($range =~ /^[0-9a-f]{4}$/) {
-	    my $i = hex($range);
-	    if ($tags[$i] eq "") {
-		$tags[$i] = $curstd;
-	    } else {
-		$tags[$i] = $curstd . "|" . $tags[$i];
-	    }
-	    if ($names[$i] ne "" && $names[$i] ne $curlang) {
-		warn sprintf ("language overlap: %s/%s at %x (tag %d)",
-			      $names[$i], $curlang, $i, $tags[$i]);
-		next;
-	    }
-	    $names[$i] = $curlang;
-	} elsif ($range =~ /^ ([0-9a-f]{4}) - ([0-9a-f]{4}) $/x) {
-	    my ($start, $end) = (hex($1), hex($2));
-	    my $i;
-	    for ($i = $start; $i <= $end; $i++) {
-		if ($tags[$i] eq "") {
-		    $tags[$i] = $curstd;
-		} else {
-		    $tags[$i] = $curstd . "|" . $tags[$i];
-		}
-		if ($names[$i] ne "" && $names[$i] ne $curlang) {
-		    warn sprintf ("language overlap: %s/%s at %x (tag %d)",
-				  $names[$i], $curlang, $i, $tags[$i]);
-		    next;
-		}
-		$names[$i] = $curlang;
-	    }
-	} else {
-	    warn "malformed range expression $range";
-	}
-    }
-}
--- a/libcpp/ucnid.tab
+++ b/libcpp/ucnid.tab
@ -1,47 +1,25 @@
-/* Table of UCNs which are valid in identifiers.
-   Copyright (C) 2003 Free Software Foundation, Inc.
-
-This program is free software; you can redistribute it and/or modify it
-under the terms of the GNU General Public License as published by the
-Free Software Foundation; either version 2, or (at your option) any
-later version.
-
-This program is distributed in the hope that it will be useful,
-but WITHOUT ANY WARRANTY; without even the implied warranty of
-MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-GNU General Public License for more details.
-
-You should have received a copy of the GNU General Public License
-along with this program; if not, write to the Free Software
-Foundation, 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.  */
-
-[dne]
-
-/* This file reproduces the table in ISO/IEC 9899:1999 (C99) Annex
-   D, which is itself a reproduction from ISO/IEC TR 10176:1998, and
-   the similar table from ISO/IEC 14882:1988 (C++98) Annex E, which is
-   a reproduction of ISO/IEC PDTR 10176.  Unfortunately these tables
-   are not identical.  */
-
-#ifndef LIBCPP_UCNID_H
-#define LIBCPP_UCNID_H
-
-#define C99 1
-#define CXX 2
-#define DIG 4
-
-struct ucnrange
-{
-  unsigned short lo, hi;
-  unsigned short flags;
-};
-
-static const struct ucnrange ucnranges[] = {
-[table]
-};
-
-#endif /* LIBCPP_UCNID_H */
-%%
+; Table of UCNs which are valid in identifiers.
+; Copyright (C) 2003, 2005 Free Software Foundation, Inc.
+; 
+; This program is free software; you can redistribute it and/or modify it
+; under the terms of the GNU General Public License as published by the
+; Free Software Foundation; either version 2, or (at your option) any
+; later version.
+; 
+; This program is distributed in the hope that it will be useful,
+; but WITHOUT ANY WARRANTY; without even the implied warranty of
+; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+; GNU General Public License for more details.
+; 
+; You should have received a copy of the GNU General Public License
+; along with this program; if not, write to the Free Software
+; Foundation, 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+; 
+; This file reproduces the table in ISO/IEC 9899:1999 (C99) Annex
+; D, which is itself a reproduction from ISO/IEC TR 10176:1998, and
+; the similar table from ISO/IEC 14882:1988 (C++98) Annex E, which is
+; a reproduction of ISO/IEC PDTR 10176.  Unfortunately these tables
+; are not identical.

 [C99]

@ -141,7 +119,6 @@ ac00-d7a3
 0b3d 1fbe 203f-2040 2102 2107 210a-2113 2115 2118-211d 2124 2126 2128
 212a-2131 2133-2138 2160-2182 3005-3007 3021-3029

-[C99|DIG]
 ; Digits
 0660-0669 06f0-06f9 0966-096f 09e6-09ef 0a66-0a6f 0ae6-0aef 0b66-0b6f
 0be7-0bef 0c66-0c6f 0ce6-0cef 0d66-0d6f 0e50-0e59 0ed0-0ed9 0f20-0f33
@ -201,16 +178,12 @@ ac00-d7a3
 ; Malayalam
 0d05-0d0c 0d0e-0d10 0d12-0d28 0d2a-0d39 0d60-0d61

-# CORRECTION: Exclude 0e50-0e59 from the Thai range and make a fake
-# Digits range for it, to match C99.  cppcharset.c knows that C++
-# doesn't distinguish digits from other UCNs valid in identifiers.
 ; Thai
-0e01-0e30 0e32-0e33 0e40-0e46 0e4f-0e49 0e5a-0e5b
+0e01-0e30 0e32-0e33 0e40-0e46 0e4f-0e5b

 ; Digits
 0e50-0e59

-# CORRECTION: Change 0e0d to 0e8d (typo in standard; see C++ DR 131)
 ; Lao
 0e81-0e82 0e84 0e87-0e88 0e8a 0e8d 0e94-0e97 0e99-0e9f 0ea1-0ea3 0ea5
 0ea7 0eaa-0eab 0ead-0eb0 0eb2 0eb3 0ebd 0ec0-0ec4 0ec6
@ -224,7 +197,6 @@ ac00-d7a3
 ; Katakana
 30a1-30fe

-# CORRECTION: language spelled "Bopmofo" in C++98.
 ; Bopomofo
 3105-312c