When ‘Für_Elise.md’ Isn’t the Same as ‘Für_Elise.md’

Decorative abstract artwork with flowing colorful ribbons, mathematical formulas, and code snippets, symbolizing the complexity and creativity of programming, Unicode, and codepages, set against a warm golden background. The image is purely decorative and does not convey specific information.

Introduction

I’m cur­rent­ly in the mid­dle of a “spring-clean­ing” phase for my Obsid­i­an vault. That’s why I decid­ed to cre­ate a small Python library with func­tions for ana­lyz­ing and man­ag­ing my notes. As part of this project, I wrote a script to help me find spe­cif­ic notes in the vault based on tags, con­tent, titles, and prop­er­ties. The script worked well: it found the notes and gen­er­at­ed wiki-links to them in a new note save in my Obsid­i­an Vault.

But then I noticed some­thing strange. When I clicked on some of the links, Obsid­i­an cre­at­ed a new note instead of open­ing the exist­ing one. This unex­pect­ed behav­ior caught me off guard. After inves­ti­gat­ing, I real­ized the issue was relat­ed to dia­crit­i­cal let­ters in the file­names. But why was this hap­pen­ing?

Why ‘Für_Elise.md’ and ‘Für_Elise.md’ Aren’t Always the Same

Mod­ern oper­at­ing sys­tems like macOS, Win­dows, and Lin­ux allow the use of dia­crit­i­cal let­ters in file­names with­out any issues. Nat­u­ral­ly, I assumed that a file­name like “Für_Elise.md” would be treat­ed the same every­where. But that’s not the case.

The prob­lem lies in some­thing called Uni­code Nor­mal­iza­tion, which goes beyond the UTF‑8 encod­ing of let­ters. While Win­dows and Lin­ux use Nor­mal­iza­tion Form C (NFC), macOS uses Nor­mal­iza­tion Form D (NFD). But what’s the dif­fer­ence?

In NFD, the “D” stands for “decom­posed.” This means char­ac­ters are stored as a base char­ac­ter com­bined with a dia­crit­i­cal mark. For exam­ple, “ü” becomes “u” plus a com­bin­ing diaere­sis, the two dot above the “u”. In con­trast, NFC (used by Win­dows and Lin­ux) stores char­ac­ters as a sin­gle, pre­com­posed enti­ty.

When I asked Chat­G­PT how to deter­mine the nor­mal­iza­tion of a file­name, it sug­gest­ed the fol­low­ing com­mand:

> $ echo -n  "Für_Elise.md" | cat -p  -A                                                                                                       
Fu\u{308}r_Elise.md

How­ev­er, this only works if I copy the file­name from Find­er and paste it into the ter­mi­nal. If I type the file­name man­u­al­ly or use the tab-key for path auto­com­ple­tion, I get this dif­fer­ent result:

> $ echo -n Für_Elise.md | cat -p -A                                                                                                           
F\u{fc}r_Elise.md

This dif­fer­ence reflects the two nor­mal­iza­tion forms in action. The first exam­ple Fu\u{308}r_Elise.md is in Nor­mal­iza­tion Form D (NFD), where char­ac­ters are stored as a base char­ac­ter plus a com­bin­ing dia­crit­i­cal mark. The sec­ond exam­ple F\u{fc}r_Elise.md is in Nor­mal­iza­tion Form C (NFC), where char­ac­ters are stored as a sin­gle pre­com­posed enti­ty.

NFC is the nor­mal­iza­tion form used by Win­dows, Lin­ux, and even macOS in cer­tain con­texts, such as the ter­mi­nal when using e.g. zsh as I do. On the oth­er hand, NFD is the default for macOS at the file sys­tem lev­el.

This dis­crep­an­cy explains now the odd behav­ior of my script. The script finds the file Fu\u{308}r_Elise.mdand builds the wiki link using the same NFD nor­mal­iza­tion. How­ev­er, Obsid­i­an can­not resolve this link. Even though both the link and the file are NFD nor­mal­ized, Obsid­i­an seems some­how to inter­nal­ly work with NFC file names. As a result, it fails to find the file and cre­ates a new one, which is then stored in NFC nor­mal­iza­tion even on the macOS file sys­tem.

Why I have this “Normalization Chaos” on my Computer

I think there is a mix­ture of rea­sons.

1. Obsidian

As I already men­tioned above, qOb­sid­i­an can­not resolve wiki-links that are both stored in NFD nor­mal­iza­tion in the note and refer to files stored in NFD nor­mal­iza­tion on macOS. Instead of find­ing the file, Obsid­i­an cre­ates a new one.
For exam­ple:

  • A link like Fu\u{308}r_Elise (NFD) in a note will fail to resolve to a file named Fu\u{308}r_Elise.md (also NFD) on macOS.
  • How­ev­er, a link like F\u{fc}r_Elise.md (NFC) will cor­rect­ly resolve to the NFD-nor­mal­ized file Fu\u{308}r_Elise.md.

This sug­gests that Obsid­i­an inter­nal­ly defaults to work­ing with NFC-nor­mal­ized names, even when the macOS file sys­tem stores file­names in NFD. The incon­sis­ten­cy caus­es issues with dia­crit­i­cal let­ters in file­names. I will def­i­nite­ly post a bug report about this. 😉

2. macOS

I’m still not sure why UTF‑8 has two dif­fer­ent (or more?) nor­mal­iza­tion forms, and why Apple chose to use NFD, while “the oth­ers” (Win­dows and Lin­ux) use NFC.

That being said, the sit­u­a­tion becomes even more com­pli­cat­ed in the macOS file sys­tem. Files could be stored in NFD and NFC nor­mal­iza­tion. For exam­ple, a file cre­at­ed using the touch com­mand in the ter­mi­nal will be NFC nor­mal­ized. Here’s what hap­pens, but keep in mind you have to use a copy of the file­name from the Find­er in the echoline.

> $ touch Für_Elise_touch
> $ echo -n Für_Elise_touch | cat -p -A                                                                  F\u{fc}r_Elise_touch

This hap­pens because the ter­mi­nal app and shell (e.g., zsh) always use NFC nor­mal­iza­tion when typ­ing, also on a Mac.

As we’ve already seen, Obsid­i­an also cre­ates files with NFC-nor­mal­ized file­names. How­ev­er, oth­er appli­ca­tions, such as Sub­lime Text, TextE­d­it, or Word, save files with NFD-nor­mal­ized filenames—all of them using the stan­dard “Save File” dia­log box.

I cre­at­ed some test files, and for the first time, I noticed how file­names with dia­crit­i­cal let­ters appear when list­ed in the ter­mi­nal (e.g., New_Note, cre­at­ed by Obsid­i­an). I had nev­er con­scious­ly noticed these visu­al dif­fer­ences until I start­ed inves­ti­gat­ing the “nor­mal­iza­tion prob­lem

Image shows how different the 'ü' looks when doing a ls command on NFC and NFD normalized filenames. The NFD letters look like not part of the font-set.

One of my Test-Scripts

Dur­ing my jour­ney through this prob­lem space, I used var­i­ous scripts and ter­mi­nal com­mands to under­stand what was hap­pen­ing. Some of these I found on the inter­net, while oth­ers were gen­er­at­ed by Chat­G­PT. One script, in par­tic­u­lar, I’d like to intro­duce now as a small add-on.

Using the CLI com­mands men­tioned ear­li­er was always a bit tricky because the strings had to be copied to the right place for test­ing. To sim­pli­fy this, I asked Chat­G­PT to cre­ate a script that ana­lyzes the Uni­code details of a giv­en input. The input can be a text string, a direc­to­ry of files, a sin­gle file, or the con­tent of a file.

Below is the result of this script ana­lyz­ing three key ele­ments:

  1. The wiki-link cre­at­ed by my search script in the result note.
Line 3: - [[Für_Elise]]
  '-' -> HYPHEN-MINUS
  ' ' -> SPACE
  '[' -> LEFT SQUARE BRACKET
  '[' -> LEFT SQUARE BRACKET
  'F' -> LATIN CAPITAL LETTER F
  'u' -> LATIN SMALL LETTER U
  '̈' -> COMBINING DIAERESIS
  'r' -> LATIN SMALL LETTER R
  '_' -> LOW LINE
  'E' -> LATIN CAPITAL LETTER E
  'l' -> LATIN SMALL LETTER L
  'i' -> LATIN SMALL LETTER I
  's' -> LATIN SMALL LETTER S
  'e' -> LATIN SMALL LETTER E
  ']' -> RIGHT SQUARE BRACKET
  ']' -> RIGHT SQUARE BRACKET

2. The file­name of the orig­i­nal NFD nor­mal­ized file name

Filename: Für_Elise.md (file name)
  'F' -> LATIN CAPITAL LETTER F
  'u' -> LATIN SMALL LETTER U
  '̈' -> COMBINING DIAERESIS
  'r' -> LATIN SMALL LETTER R
  '_' -> LOW LINE
  'E' -> LATIN CAPITAL LETTER E
  'l' -> LATIN SMALL LETTER L
  'i' -> LATIN SMALL LETTER I
  's' -> LATIN SMALL LETTER S
  'e' -> LATIN SMALL LETTER E
  '.' -> FULL STOP
  'm' -> LATIN SMALL LETTER M
  'd' -> LATIN SMALL LETTER D

3. The new­ly cre­at­ed NFC file name gen­er­at­ed when click­ing the wiki-link.

Filename: Für_Elise.md
  'F' -> LATIN CAPITAL LETTER F
  'ü' -> LATIN SMALL LETTER U WITH DIAERESIS
  'r' -> LATIN SMALL LETTER R
  '_' -> LOW LINE
  'E' -> LATIN CAPITAL LETTER E
  'l' -> LATIN SMALL LETTER L
  'i' -> LATIN SMALL LETTER I
  's' -> LATIN SMALL LETTER S
  'e' -> LATIN SMALL LETTER E
  '.' -> FULL STOP
  'm' -> LATIN SMALL LETTER M
  'd' -> LATIN SMALL LETTER D

And here is the script:

#!/usr/bin/python3
import unicodedata
import os
import argparse

def analyze_text(text):
    """
    Analyze the Unicode details of each character in the text.

    Args:
        text (str): The input text to analyze.
    """
    for char in text:
        try:
            unicode_name = unicodedata.name(char)
        except ValueError:
            unicode_name = "UNKNOWN CHARACTER"
        print(f"  '{char}' -> {unicode_name}")

def analyze_filenames(input_path):
    """
    Analyze filenames for a given path. If the path is a file, analyze its filename.
    If it is a directory, analyze all filenames in the directory.

    Args:
        input_path (str): Path to the file or directory.
    """
    if os.path.isfile(input_path):
        filename = os.path.basename(input_path)
        print(f"Analyzing filename: {filename}")
        print("-" * 50)
        analyze_text(filename)
        print("-" * 50)
    elif os.path.isdir(input_path):
        print(f"Analyzing filenames in directory: {input_path}")
        print("-" * 50)
        for root, _, files in os.walk(input_path):
            for file in files:
                print(f"Filename: {file}")
                analyze_text(file)
                print("-" * 50)
    else:
        print(f"Error: {input_path} is neither a valid file nor a directory.")

def check_unicode_details(text):
    """
    Check and display the Unicode names for each character in the given text.

    Args:
        text (str): The input text to analyze.
    """
    print("Unicode Analysis:")
    print("-" * 50)
    lines = text.splitlines()
    for line_no, line in enumerate(lines, start=1):
        print(f"Line {line_no}: {line}")
        analyze_text(line)
        print("-" * 50)

def main():
    parser = argparse.ArgumentParser(
        description="Analyze Unicode details of text, file content, or filenames.",
        epilog="Examples:\n"
               "  python check_unicode.py \"Hello, world!\"\n"
               "  python check_unicode.py /path/to/file.txt\n"
               "  python check_unicode.py /path/to/directory -n",
        formatter_class=argparse.RawTextHelpFormatter
    )
    parser.add_argument("input", help="Text, file path, or directory path")
    parser.add_argument("-n", "--name-only", action="store_true",
                        help="Analyze filenames. If input is a directory, scan all filenames; "
                             "if input is a file, analyze the file's name.")
    args = parser.parse_args()

    if args.name_only:
        analyze_filenames(args.input)
    else:
        if os.path.isdir(args.input):
            print(f"Error: {args.input} is a directory. Use the -n option to analyze filenames.")
            return

        try:
            # Check if the input is a file
            with open(args.input, 'r', encoding='utf-8') as file:
                text = file.read()
                print(f"Analyzing text from file: {args.input}")
        except FileNotFoundError:
            # Treat input as a direct text argument
            text = args.input
            print("Analyzing provided text input:")
        except UnicodeDecodeError:
            print("Error: The file cannot be decoded using UTF-8. Please check the file encoding.")
            return

        check_unicode_details(text)

if __name__ == "__main__":
    main()

How it works:

  1. Open your favorite text edi­tor.
  2. Paste the con­tent of the code block into the edi­tor. Don’t for­get to copy the she­bang line at the top. This line should point to your Python 3 instal­la­tion. You can ver­i­fy the cor­rect path using the com­mand: which python3.
  3. Save the file as check_unicode.
  4. Make the script exe­cutable by run­ning:
chmod +x check_unicode
  1. Run the script direct­ly with ./check_unicode, or move it to a fold­er includ­ed in your PATH envi­ron­ment vari­able (e.g., ~/Applications). Once it’s in your PATH, you can sim­ply exe­cute the script by typ­ing check_unicode in the ter­mi­nal.

Chat­G­PT also cre­at­ed a help­ful man page for this script, mak­ing it even eas­i­er to use.

CHECK_UNICODE(1)                                   User Commands                                   CHECK_UNICODE(1)

NAME
       check_unicode - Analyze Unicode details of text, file content, or filenames in a directory or file.

SYNOPSIS
       check_unicode [OPTIONS] INPUT

DESCRIPTION
       The check_unicode utility provides a detailed breakdown of Unicode characters in a given text, 
       file content, or filenames within a directory or a single file. It is useful for inspecting and debugging 
       Unicode encoding issues and identifying special characters.

       The utility can operate in two modes:
       1. Analyze the characters in text or file content (default).
       2. Analyze filenames in a directory or a single file (with the -n option).

OPTIONS
       -n, --name-only
              Analyze filenames. If INPUT is:
              - A directory: Scans and analyzes all filenames in the directory.
              - A file: Analyzes the filename of the specified file.

       -h, --help
              Display a help message and exit.

ARGUMENTS
       INPUT
              The text, file path, or directory path to analyze.

USAGE
       Analyze a simple text string:
              check_unicode "Hello, Ünicode!"

       Analyze the content of a file:
              check_unicode /path/to/file.txt

       Analyze the filename of a single file:
              check_unicode /path/to/file.txt -n

       Analyze all filenames in a directory:
              check_unicode /path/to/directory -n

       Display help information:
              check_unicode --help

ERRORS
       If INPUT is a directory but the -n option is not specified:
              Error: /path/to/directory is a directory. Use the -n option to analyze filenames.

       If INPUT is invalid:
              Error: INPUT is neither a valid file nor a directory.

       If a file cannot be decoded as UTF-8:
              Error: The file cannot be decoded using UTF-8. Please check the file encoding.

EXIT STATUS
       The check_unicode utility exits with the following codes:
       0      Successful execution.
       1      Invalid input or runtime error (e.g., file not found, invalid directory).

EXAMPLES
       Analyze the characters in a string:
              check_unicode "Hello, world!"

       Analyze the content of a UTF-8 text file:
              check_unicode /Users/name/Documents/example.txt

       Analyze the filename of a single file:
              check_unicode /Users/name/Documents/example.txt -n

       Analyze filenames in a directory:
              check_unicode /Users/name/folder -n

Conclusion

It took my some time to under­stand the real prob­lem because there are so much lev­el involved. So I hope I brought some light into the ques­tion of UTF Nor­mal­iza­tion. Also I think i only scratch­es the sur­face of the UTF-Hell 😉

I now have to add into my search script some­thing like this:

# Normalize the string to NFC (precomposed form)
nfc_normalized_string = unicodedata.normalize("NFC", nfd_string)

# Format the string as a wiki-link
wiki_link = f"[[{nfc_normalized_string}]]"

And then my script will also work “Für_Elise” regard­less how she is nor­mal­ized 😉

So if there are any addi­tions, cor­rec­tions or oth­er feed­back feel free to use the com­ment box below. If you com­ment via the Fer­di­verse, I still not 100 sure if my answers in this blog will reach you, so just check this page again.

Leave a Reply

Your email address will not be published. Required fields are marked *