Announcement

Collapse
No announcement yet.

utf-8 encoding for Swedish ä and ö

Collapse
This topic is closed.
X
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    utf-8 encoding for Swedish ä and ö

    The GEDCOM file written by Reunion 10 is improperly encoding

    #2
    Re: Improper utf-8 encoding for Swedish

    This is not improper UTF-8. Unicode text can be stored as a series of code points in either a "composed" or "decomposed" form. Reunion is writing its GEDCOMs in the decomposed form, in which characters such as "
    Brad Mohr
    https://bradandkathy.com/genealogy/

    Comment


      #3
      Re: Improper utf-8 encoding for Swedish

      And I'm not sure it matters much - the specific situation Paul has encountered is some odd transfer of data, including filenames with
      Roger Moffat
      http://lisaandroger.com/genealogy/
      http://genealogy.clanmoffat.org/

      Comment


        #4
        Re: Improper utf-8 encoding for Swedish

        Thanks for the interesting post on Unicode fine points. I understand that two Unicode strings may display identically and that GEDCOMs don't need to use any canonical forms. So now I have another basic question.

        After I've exported a GEDCOM from Reunion I upload it to my web site and import it into TNG. Then I want to search for a string that contains a Swedish character such as ä. Some are found and some aren't, and the reason is that the search code doesn't consider that ä has been encoded in more than one way.

        What code must be changed so that the search works for all possible byte strings that represent ä?

        Comment


          #5
          Re: Improper utf-8 encoding for Swedish

          [QUOTE=Paul Johnson;39715]After I've exported a GEDCOM from Reunion I upload it to my web site and import it into TNG. Then I want to search for a string that contains a Swedish character such as
          Brad Mohr
          https://bradandkathy.com/genealogy/

          Comment


            #6
            Re: Improper utf-8 encoding for Swedish

            Originally posted by bmohr View Post
            It sounds like you're using the wrong collation setting for your MySQL tables. The utf8_bin collation makes comparisons codepoint-by-codepoint, so semantically-identical strings won't necessarily match if they're composed differently. In most situations, you would probably want to use the utf8_unicode_ci collation (utf8_general_ci would work, too, but there's no real reason to use it over utf8_unicode_ci these days).
            I set up the collation sequence as utf8_swedish_ci. This is required in order for the 3 Swedish characters (

            Comment


              #7
              Re: Improper utf-8 encoding for Swedish

              [QUOTE=Paul Johnson;39718]I set up the collation sequence as utf8_swedish_ci. This is required in order for the 3 Swedish characters (
              Brad Mohr
              https://bradandkathy.com/genealogy/

              Comment


                #8
                Re: Improper utf-8 encoding for Swedish

                Originally posted by bmohr View Post
                I noticed that TNG sets the database connection character set to utf8 only if the browser session charset is UTF-8. Most modern browsers default to UTF8, but it's worth checking. You might also verify that your database has the same collation settings at the field, table, and database levels.
                As long as TNG is set to use UTF-8 as the character set, then TNG will use that for the session charset, and every call to the database invokes the MySQL command

                Code:
                	if ($session_charset == 'UTF-8')
                	  	@mysql_query("SET NAMES 'utf8'");
                so as long as the database and all its tables are set to the right collation, the right queries should be being done.

                As I noted above, a GEDCOM file I exported from Paul's Reunion database imported in to one of my TNG testing sites set to utf8_unicode_ci apparently correctly. Having utf8_swedish_ci should not have made a difference.

                Roger
                Roger Moffat
                http://lisaandroger.com/genealogy/
                http://genealogy.clanmoffat.org/

                Comment


                  #9
                  Re: utf-8 encoding for Swedish

                  [QUOTE=Paul Johnson;39702]The GEDCOM file written by Reunion 10 is improperly encoding

                  Comment


                    #10
                    Re: utf-8 encoding for Swedish

                    In my Reunion family file I have two different encodings of the 3 Swedish characters (

                    Comment


                      #11
                      Re: utf-8 encoding for Swedish

                      I'm attaching a .jpg file that shows the first 50 or so lines where my post processor needed to change the UTF-8 encoding in the exported GEDCOM file.
                      Attached Files

                      Comment


                        #12
                        Re: utf-8 encoding for Swedish

                        [QUOTE=Paul Johnson;40004]In my Reunion family file I have two different encodings of the 3 Swedish characters (
                        Surnames Dresch, Eyden, Lunn, Mountfort, Page, Robinson, Ryan, Whitworth, and more.

                        Comment


                          #13
                          Re: utf-8 encoding for Swedish

                          Thanks, Tom. I agree that it's probably best to just run my "post-processor" on any GEDCOM created by Reunion to ensure that all the Unicode characters are in a canonical form.

                          I did try replacing the Swedish characters (

                          Comment


                            #14
                            Re: utf-8 encoding for Swedish

                            I can post the code for the post-processor -- it's very short -- if someone can tell me how to preserve the appearance of the code in a post (e.g. by enclosing in /code's).

                            Comment


                              #15
                              Re: utf-8 encoding for Swedish

                              code for GEDCOM post-processor:
                              Attached Files

                              Comment

                              Working...
                              X