id	summary	reporter	owner	description	type	status	priority	milestone	component	version	resolution	keywords	cc	blockedby	blocking	branch_state	votes
3616	Speed up of str_utf8_normalize() for a frequent special case	devzero	andrew_b	"When content of a large directory is being sorted by file names, a significant amount of CPU time is spent in str_utf8_normalize() that is called from str_utf8_create_key_gen().

For example, /usr/bin/ contains 5437 files on my Archlinux box. Running `mc /usr/bin/ /usr/bin/` takes approx. 75 000 000 CPU instructions to sort file names, or 25% of total program run time. From these 75 000 000 instructions, 42 500 000 instruction are spent in str_utf8_normalize().

str_utf8_normalize() uses g_utf8_normalize() to do the work. g_utf8_normalize() is a heavyweight function, that converts UTF-8 into UCS-4, does the normalization and then converts UCS-4 back into UTF-8.

Since file names are composed of ASCII characters in most cases, we can speed up str_utf8_normalize() by checking if the heavyweight Unicode normalization is actually needed. Normalization of ASCII string is no-op, so it is effectively ""normalized"" by just strdup(). Here is the patch:

{{{
diff --git a/lib/strutil/strutilutf8.c b/lib/strutil/strutilutf8.c
index 8ec754d..8c7f909 100644
--- a/lib/strutil/strutilutf8.c
+++ b/lib/strutil/strutilutf8.c
@@ -1080,6 +1080,17 @@ str_utf8_normalize (const char *text)
     const char *start;
     const char *end;
 
+    const char *p = text;
+    while (1)
+    {
+        char c = *p;
+        if (c == 0)
+            return g_strndup(text, p - text);
+        else if ((c & 0x80) != 0)
+            break;
+        p++;
+    }
+
     fixed = g_string_sized_new (4);
 
     start = text;
}}}

With this patch, running `mc /usr/bin/ /usr/bin/` requires just 37 000 000 instructions to sort the file names (down from 75 000 000) and 4 500 000 instuctions to do str_utf8_normalize() (down from 42 500 000).

"	enhancement	closed	major	4.8.20	mc-core	master	fixed					merged	committed-master
