The GEDCOM file written by Reunion 10 is improperly encoding
Announcement
Collapse
No announcement yet.
utf-8 encoding for Swedish ä and ö
Collapse
This topic is closed.
X
X
-
Re: Improper utf-8 encoding for Swedish
This is not improper UTF-8. Unicode text can be stored as a series of code points in either a "composed" or "decomposed" form. Reunion is writing its GEDCOMs in the decomposed form, in which characters such as "Brad Mohr
https://bradandkathy.com/genealogy/
-
Re: Improper utf-8 encoding for Swedish
Thanks for the interesting post on Unicode fine points. I understand that two Unicode strings may display identically and that GEDCOMs don't need to use any canonical forms. So now I have another basic question.
After I've exported a GEDCOM from Reunion I upload it to my web site and import it into TNG. Then I want to search for a string that contains a Swedish character such as ä. Some are found and some aren't, and the reason is that the search code doesn't consider that ä has been encoded in more than one way.
What code must be changed so that the search works for all possible byte strings that represent ä?
Comment
-
Re: Improper utf-8 encoding for Swedish
[QUOTE=Paul Johnson;39715]After I've exported a GEDCOM from Reunion I upload it to my web site and import it into TNG. Then I want to search for a string that contains a Swedish character such asBrad Mohr
https://bradandkathy.com/genealogy/
Comment
-
Re: Improper utf-8 encoding for Swedish
I set up the collation sequence as utf8_swedish_ci. This is required in order for the 3 Swedish characters (Originally posted by bmohr View PostIt sounds like you're using the wrong collation setting for your MySQL tables. The utf8_bin collation makes comparisons codepoint-by-codepoint, so semantically-identical strings won't necessarily match if they're composed differently. In most situations, you would probably want to use the utf8_unicode_ci collation (utf8_general_ci would work, too, but there's no real reason to use it over utf8_unicode_ci these days).
Comment
-
Re: Improper utf-8 encoding for Swedish
[QUOTE=Paul Johnson;39718]I set up the collation sequence as utf8_swedish_ci. This is required in order for the 3 Swedish characters (Brad Mohr
https://bradandkathy.com/genealogy/
Comment
-
Re: Improper utf-8 encoding for Swedish
As long as TNG is set to use UTF-8 as the character set, then TNG will use that for the session charset, and every call to the database invokes the MySQL commandOriginally posted by bmohr View PostI noticed that TNG sets the database connection character set to utf8 only if the browser session charset is UTF-8. Most modern browsers default to UTF8, but it's worth checking. You might also verify that your database has the same collation settings at the field, table, and database levels.
so as long as the database and all its tables are set to the right collation, the right queries should be being done.Code:if ($session_charset == 'UTF-8') @mysql_query("SET NAMES 'utf8'");
As I noted above, a GEDCOM file I exported from Paul's Reunion database imported in to one of my TNG testing sites set to utf8_unicode_ci apparently correctly. Having utf8_swedish_ci should not have made a difference.
Roger
Comment
Comment