When ‘Für_Elise.md’ Isn’t the Same as ‘Für_Elise.md’

Jan 28, 2025

CodingTips, macOS, Obsidian, Python, UniCode

Introduction

I’m currently in the middle of a “spring-cleaning” phase for my Obsidian vault. That’s why I decided to create a small Python library with functions for analyzing and managing my notes. As part of this project, I wrote a script to help me find specific notes in the vault based on tags, content, titles, and properties. The script worked well: it found the notes and generated wiki-links to them in a new note save in my Obsidian Vault.

But then I noticed something strange. When I clicked on some of the links, Obsidian created a new note instead of opening the existing one. This unexpected behavior caught me off guard. After investigating, I realized the issue was related to diacritical letters in the filenames. But why was this happening?

Why ‘Für_Elise.md’ and ‘Für_Elise.md’ Aren’t Always the Same

Modern operating systems like macOS, Windows, and Linux allow the use of diacritical letters in filenames without any issues. Naturally, I assumed that a filename like “Für_Elise.md” would be treated the same everywhere. But that’s not the case.

The problem lies in something called Unicode Normalization, which goes beyond the UTF‑8 encoding of letters. While Windows and Linux use Normalization Form C (NFC), macOS uses Normalization Form D (NFD). But what’s the difference?

In NFD, the “D” stands for “decomposed.” This means characters are stored as a base character combined with a diacritical mark. For example, “ü” becomes “u” plus a combining diaeresis, the two dot above the “u”. In contrast, NFC (used by Windows and Linux) stores characters as a single, precomposed entity.

When I asked ChatGPT how to determine the normalization of a filename, it suggested the following command:

> $ echo -n  "Für_Elise.md" | cat -p  -A                                                                                                       
Fu\u{308}r_Elise.md

However, this only works if I copy the filename from Finder and paste it into the terminal. If I type the filename manually or use the tab-key for path autocompletion, I get this different result:

> $ echo -n Für_Elise.md | cat -p -A                                                                                                           
F\u{fc}r_Elise.md

This difference reflects the two normalization forms in action. The first example Fu\u{308}r_Elise.md is in Normalization Form D (NFD), where characters are stored as a base character plus a combining diacritical mark. The second example F\u{fc}r_Elise.md is in Normalization Form C (NFC), where characters are stored as a single precomposed entity.

NFC is the normalization form used by Windows, Linux, and even macOS in certain contexts, such as the terminal when using e.g. zsh as I do. On the other hand, NFD is the default for macOS at the file system level.

This discrepancy explains now the odd behavior of my script. The script finds the file Fu\u{308}r_Elise.mdand builds the wiki link using the same NFD normalization. However, Obsidian cannot resolve this link. Even though both the link and the file are NFD normalized, Obsidian seems somehow to internally work with NFC file names. As a result, it fails to find the file and creates a new one, which is then stored in NFC normalization even on the macOS file system.

Why I have this “Normalization Chaos” on my Computer

I think there is a mixture of reasons.

1. Obsidian

As I already mentioned above, qObsidian cannot resolve wiki-links that are both stored in NFD normalization in the note and refer to files stored in NFD normalization on macOS. Instead of finding the file, Obsidian creates a new one.
For example:

A link like Fu\u{308}r_Elise (NFD) in a note will fail to resolve to a file named Fu\u{308}r_Elise.md (also NFD) on macOS.
However, a link like F\u{fc}r_Elise.md (NFC) will correctly resolve to the NFD-normalized file Fu\u{308}r_Elise.md.

This suggests that Obsidian internally defaults to working with NFC-normalized names, even when the macOS file system stores filenames in NFD. The inconsistency causes issues with diacritical letters in filenames. I will definitely post a bug report about this. 😉

2. macOS

I’m still not sure why UTF‑8 has two different (or more?) normalization forms, and why Apple chose to use NFD, while “the others” (Windows and Linux) use NFC.

That being said, the situation becomes even more complicated in the macOS file system. Files could be stored in NFD and NFC normalization. For example, a file created using the touch command in the terminal will be NFC normalized. Here’s what happens, but keep in mind you have to use a copy of the filename from the Finder in the echoline.

> $ touch Für_Elise_touch
> $ echo -n Für_Elise_touch | cat -p -A                                                                  F\u{fc}r_Elise_touch

This happens because the terminal app and shell (e.g., zsh) always use NFC normalization when typing, also on a Mac.

As we’ve already seen, Obsidian also creates files with NFC-normalized filenames. However, other applications, such as Sublime Text, TextEdit, or Word, save files with NFD-normalized filenames—all of them using the standard “Save File” dialog box.

I created some test files, and for the first time, I noticed how filenames with diacritical letters appear when listed in the terminal (e.g., New_Note, created by Obsidian). I had never consciously noticed these visual differences until I started investigating the “normalization problem

One of my Test-Scripts

During my journey through this problem space, I used various scripts and terminal commands to understand what was happening. Some of these I found on the internet, while others were generated by ChatGPT. One script, in particular, I’d like to introduce now as a small add-on.

Using the CLI commands mentioned earlier was always a bit tricky because the strings had to be copied to the right place for testing. To simplify this, I asked ChatGPT to create a script that analyzes the Unicode details of a given input. The input can be a text string, a directory of files, a single file, or the content of a file.

Below is the result of this script analyzing three key elements:

The wiki-link created by my search script in the result note.

Line 3: - [[Für_Elise]]
  '-' -> HYPHEN-MINUS
  ' ' -> SPACE
  '[' -> LEFT SQUARE BRACKET
  '[' -> LEFT SQUARE BRACKET
  'F' -> LATIN CAPITAL LETTER F
  'u' -> LATIN SMALL LETTER U
  '̈' -> COMBINING DIAERESIS
  'r' -> LATIN SMALL LETTER R
  '_' -> LOW LINE
  'E' -> LATIN CAPITAL LETTER E
  'l' -> LATIN SMALL LETTER L
  'i' -> LATIN SMALL LETTER I
  's' -> LATIN SMALL LETTER S
  'e' -> LATIN SMALL LETTER E
  ']' -> RIGHT SQUARE BRACKET
  ']' -> RIGHT SQUARE BRACKET

2. The filename of the original NFD normalized file name

Filename: Für_Elise.md (file name)
  'F' -> LATIN CAPITAL LETTER F
  'u' -> LATIN SMALL LETTER U
  '̈' -> COMBINING DIAERESIS
  'r' -> LATIN SMALL LETTER R
  '_' -> LOW LINE
  'E' -> LATIN CAPITAL LETTER E
  'l' -> LATIN SMALL LETTER L
  'i' -> LATIN SMALL LETTER I
  's' -> LATIN SMALL LETTER S
  'e' -> LATIN SMALL LETTER E
  '.' -> FULL STOP
  'm' -> LATIN SMALL LETTER M
  'd' -> LATIN SMALL LETTER D

3. The newly created NFC file name generated when clicking the wiki-link.

Filename: Für_Elise.md
  'F' -> LATIN CAPITAL LETTER F
  'ü' -> LATIN SMALL LETTER U WITH DIAERESIS
  'r' -> LATIN SMALL LETTER R
  '_' -> LOW LINE
  'E' -> LATIN CAPITAL LETTER E
  'l' -> LATIN SMALL LETTER L
  'i' -> LATIN SMALL LETTER I
  's' -> LATIN SMALL LETTER S
  'e' -> LATIN SMALL LETTER E
  '.' -> FULL STOP
  'm' -> LATIN SMALL LETTER M
  'd' -> LATIN SMALL LETTER D

And here is the script:

#!/usr/bin/python3
import unicodedata
import os
import argparse

def analyze_text(text):
    """
    Analyze the Unicode details of each character in the text.

    Args:
        text (str): The input text to analyze.
    """
    for char in text:
        try:
            unicode_name = unicodedata.name(char)
        except ValueError:
            unicode_name = "UNKNOWN CHARACTER"
        print(f"  '{char}' -> {unicode_name}")

def analyze_filenames(input_path):
    """
    Analyze filenames for a given path. If the path is a file, analyze its filename.
    If it is a directory, analyze all filenames in the directory.

    Args:
        input_path (str): Path to the file or directory.
    """
    if os.path.isfile(input_path):
        filename = os.path.basename(input_path)
        print(f"Analyzing filename: {filename}")
        print("-" * 50)
        analyze_text(filename)
        print("-" * 50)
    elif os.path.isdir(input_path):
        print(f"Analyzing filenames in directory: {input_path}")
        print("-" * 50)
        for root, _, files in os.walk(input_path):
            for file in files:
                print(f"Filename: {file}")
                analyze_text(file)
                print("-" * 50)
    else:
        print(f"Error: {input_path} is neither a valid file nor a directory.")

def check_unicode_details(text):
    """
    Check and display the Unicode names for each character in the given text.

    Args:
        text (str): The input text to analyze.
    """
    print("Unicode Analysis:")
    print("-" * 50)
    lines = text.splitlines()
    for line_no, line in enumerate(lines, start=1):
        print(f"Line {line_no}: {line}")
        analyze_text(line)
        print("-" * 50)

def main():
    parser = argparse.ArgumentParser(
        description="Analyze Unicode details of text, file content, or filenames.",
        epilog="Examples:\n"
               "  python check_unicode.py \"Hello, world!\"\n"
               "  python check_unicode.py /path/to/file.txt\n"
               "  python check_unicode.py /path/to/directory -n",
        formatter_class=argparse.RawTextHelpFormatter
    )
    parser.add_argument("input", help="Text, file path, or directory path")
    parser.add_argument("-n", "--name-only", action="store_true",
                        help="Analyze filenames. If input is a directory, scan all filenames; "
                             "if input is a file, analyze the file's name.")
    args = parser.parse_args()

    if args.name_only:
        analyze_filenames(args.input)
    else:
        if os.path.isdir(args.input):
            print(f"Error: {args.input} is a directory. Use the -n option to analyze filenames.")
            return

        try:
            # Check if the input is a file
            with open(args.input, 'r', encoding='utf-8') as file:
                text = file.read()
                print(f"Analyzing text from file: {args.input}")
        except FileNotFoundError:
            # Treat input as a direct text argument
            text = args.input
            print("Analyzing provided text input:")
        except UnicodeDecodeError:
            print("Error: The file cannot be decoded using UTF-8. Please check the file encoding.")
            return

        check_unicode_details(text)

if __name__ == "__main__":
    main()

How it works:

Open your favorite text editor.
Paste the content of the code block into the editor. Don’t forget to copy the shebang line at the top. This line should point to your Python 3 installation. You can verify the correct path using the command: which python3.
Save the file as check_unicode.
Make the script executable by running:

chmod +x check_unicode

Run the script directly with ./check_unicode, or move it to a folder included in your PATH environment variable (e.g., ~/Applications). Once it’s in your PATH, you can simply execute the script by typing check_unicode in the terminal.

ChatGPT also created a helpful man page for this script, making it even easier to use.

CHECK_UNICODE(1)                                   User Commands                                   CHECK_UNICODE(1)

NAME
       check_unicode - Analyze Unicode details of text, file content, or filenames in a directory or file.

SYNOPSIS
       check_unicode [OPTIONS] INPUT

DESCRIPTION
       The check_unicode utility provides a detailed breakdown of Unicode characters in a given text, 
       file content, or filenames within a directory or a single file. It is useful for inspecting and debugging 
       Unicode encoding issues and identifying special characters.

       The utility can operate in two modes:
       1. Analyze the characters in text or file content (default).
       2. Analyze filenames in a directory or a single file (with the -n option).

OPTIONS
       -n, --name-only
              Analyze filenames. If INPUT is:
              - A directory: Scans and analyzes all filenames in the directory.
              - A file: Analyzes the filename of the specified file.

       -h, --help
              Display a help message and exit.

ARGUMENTS
       INPUT
              The text, file path, or directory path to analyze.

USAGE
       Analyze a simple text string:
              check_unicode "Hello, Ünicode!"

       Analyze the content of a file:
              check_unicode /path/to/file.txt

       Analyze the filename of a single file:
              check_unicode /path/to/file.txt -n

       Analyze all filenames in a directory:
              check_unicode /path/to/directory -n

       Display help information:
              check_unicode --help

ERRORS
       If INPUT is a directory but the -n option is not specified:
              Error: /path/to/directory is a directory. Use the -n option to analyze filenames.

       If INPUT is invalid:
              Error: INPUT is neither a valid file nor a directory.

       If a file cannot be decoded as UTF-8:
              Error: The file cannot be decoded using UTF-8. Please check the file encoding.

EXIT STATUS
       The check_unicode utility exits with the following codes:
       0      Successful execution.
       1      Invalid input or runtime error (e.g., file not found, invalid directory).

EXAMPLES
       Analyze the characters in a string:
              check_unicode "Hello, world!"

       Analyze the content of a UTF-8 text file:
              check_unicode /Users/name/Documents/example.txt

       Analyze the filename of a single file:
              check_unicode /Users/name/Documents/example.txt -n

       Analyze filenames in a directory:
              check_unicode /Users/name/folder -n

Conclusion

It took my some time to understand the real problem because there are so much level involved. So I hope I brought some light into the question of UTF Normalization. Also I think i only scratches the surface of the UTF-Hell 😉

I now have to add into my search script something like this:

# Normalize the string to NFC (precomposed form)
nfc_normalized_string = unicodedata.normalize("NFC", nfd_string)

# Format the string as a wiki-link
wiki_link = f"[[{nfc_normalized_string}]]"

And then my script will also work “Für_Elise” regardless how she is normalized 😉

So if there are any additions, corrections or other feedback feel free to use the comment box below. If you comment via the Ferdiverse, I still not 100 sure if my answers in this blog will reach you, so just check this page again.

dit und dat