mcedit, 8-bit terminal, encoding=UTF-8: characters between 0x80 and 0xff are broken #3843

mc-butler · 2017-07-25T14:54:54Z

Important

This issue was migrated from Trac:

Origin	https://midnight-commander.org/ticket/3843
Reporter	`lzsiga` (@lzsiga)
Mentions	`egmont` (@egmontkob)
Keywords	`mcedit`, `utf8`

Hi, this problem is present in mc-4.8.19 (reproduced on Debian/Linux and AIX), it affects the editor.

Problem description:

I use a 8-bit terminal-emulator with LC_CYTPE=hu_HU.ISO-8859-2 (for Hungarian language), and it works properly; in mcedit's 'Choose encoding' dialog I select 'UTF-8' (as I want to create a file in UTF-8).

Then I try some accented letters in the editor, such as á é ő ű. They are all in ISO-8859-2 (codes e1, e9, f5, fb), but only first two are in ISO-8859-1; the unicodes are: U+E1, U+E9, U+0151, U+0171

The problem is that only the two latter characters are properly displayed and stored in file; for the two former, editor displays dots instead of them; and saving into file, instead of UTF8-sequences (c3e1, c3e9) it stores single-bytes (the ISO-8859-2 codes: e1 e9)

I think I found the source of the problem in src/editor/edit.c, line 3559

       if (char_for_insertion > 255 && !mc_global.utf8_display)

It ignores characters between 128 and 255 even if 'UTF-8' is selected (it is mc_global.source_codepage==12 in my case)
The change I suggest is this:

        if ((char_for_insertion > 255 ||
            (char_for_insertion > 127 && str_isutf8 (get_codepage_id (mc_global.source_codepage)))) &&
            !mc_global.utf8_display)
        {

I tested it on linux and AIX in different contexts (8-bit emulator vs unicode-emulator; 8-bit file-encoding vs unicode), and it seemed working in all cases.

(I admit, the method of checking whether mc_global.source_codepage is UTF-8 or not is a bit clumsy, but I couldn't find a simpler method.)

Note

Original attachments:

edit.patch (raw) by lzsiga (@lzsiga) on Jul 25, 2017 at 14:57 UTC
edit2.patch (raw) by zaytsev (@zyv) on Jul 26, 2017 at 6:28 UTC

The text was updated successfully, but these errors were encountered:

mc-butler · 2017-07-25T14:57:40Z

Changed by lzsiga (@lzsiga) on Jul 25, 2017 at 14:57 UTC

Attachment edit.patch (raw) added

suggested change in src/editor/edit.c

mc-butler · 2017-07-25T20:22:01Z

Changed by zaytsev (@zyv) on Jul 25, 2017 at 20:22 UTC (comment 1)

Cc set to egmont
Priority changed from trivial to major

At first I was skeptical about this patch, but then it occurred to me that it actually does make complete sense. If target encoding is UTF-8, then everything above 0x9F must be encoded, even if it has the code point < 255 and representable in the current encoding as a character between 0xA0 and 0xFF.

I also managed to reproduce it using CP1251 (Cyrillic) locale, there characters « & » are representable in the higher table and get inserted literally when target is UTF-8, although they should be encoded.

Egmont, WDYT?

mc-butler · 2017-07-25T20:30:05Z

Changed by egmont (@egmontkob) on Jul 25, 2017 at 20:30 UTC (comment 2)

I can take a look next week. Feel free to ping me if I forget about it.

mc-butler · 2017-07-25T20:34:52Z

Changed by zaytsev (@zyv) on Jul 25, 2017 at 20:34 UTC (comment 3)

Milestone changed from Future Releases to 4.8.20

Ok, then I'm retargeting to this milestone, so that I won't forget to ping you :)

mc-butler · 2017-07-26T06:18:41Z

Changed by zaytsev (@zyv) on Jul 26, 2017 at 6:18 UTC (comment 4)

I've tested the patch on master while I was at it, and it works, but now I'm wondering whether the original condition makes sense at all??? I have no idea how the editor works, but it feels like if charset conversions are enabled AND current display encoding is NOT UTF-8, it encodes everything >255 as UTF-8 no matter what target encoding is.

To my mind, the logical behavior should be actually encode >127 as UTF-8 if the target encoding is UTF-8, and if it's singlebyte (we don't have other multibytes on the list), insert a replacement character (\0 ?) for everything >255. I still don't completely understand what it has to do with display encoding, but sadly no more time to investigate now...

mc-butler · 2017-07-26T06:28:01Z

Changed by zaytsev (@zyv) on Jul 26, 2017 at 6:28 UTC

Attachment edit2.patch (raw) added

mc-butler · 2017-07-26T06:31:17Z

Changed by zaytsev (@zyv) on Jul 26, 2017 at 6:31 UTC (comment 5)

Okay, I guess checking for display encoding has to do with input already being UTF-8, so the correct patch seems to be as follows. Still, apparently there is an independent bug:

Switch locale to CP1251
Edit a file with a target encoding of UTF-8
Position the cursor at the end of the file
Insert characters
They are inserted at the wrong place, because the bytes are not counted correctly.

Now off to work!

mc-butler · 2017-07-26T10:57:18Z

Changed by lzsiga (@lzsiga) on Jul 26, 2017 at 10:57 UTC (comment 6)

Hi,
I'd like attach a table about the four possibilities when arriving to the line in question (editor.c:3359)

 terminal file  mc_global.    mc_global.       char_to_insert
                .utf8_display .source_codepage
 -------- ----  ------------- ---------------- --------------
 8-bit    8-bit 0             2                iso-code 00..ff
 8-bit    utf8  0             12               unicode  00..10ffff
 utf8     8-bit 1             2                iso-code 00..ff
 utf8     utf8  1             12               utf8 1-4 bytes 00..ff

In the last case function edit_execute_cmd is called 1-4 times
(once for each byte of the utf8-sequence)

mc-butler · 2017-07-31T19:39:33Z

Changed by zaytsev (@zyv) on Jul 31, 2017 at 19:39 UTC (comment 7)

Egmont, ping! I think my patch is in line with the table by lzsiga.

mc-butler · 2017-08-13T09:04:29Z

Changed by zaytsev (@zyv) on Aug 13, 2017 at 9:04 UTC (comment 8)

Hi Egmont, I see you're alive, any input here ;-) ? @z

mc-butler · 2017-08-13T11:50:36Z

Changed by egmont (@egmontkob) on Aug 13, 2017 at 11:50 UTC (comment 9)

Sorry I'm absolutely overloaded these days. Do you have any particular concern or question? If you're confident enough with your patch, just would prefer to have a second eye take a look at it, you should rather go ahead and commit :)

mc-butler · 2017-08-13T17:52:14Z

Changed by zaytsev (@zyv) on Aug 13, 2017 at 17:52 UTC (comment 10)

Sure I have a particular concern: I think the patch is fine, but I completely lack the understanding of how the editor is supposed to work, so I feel uncomfortable committing it without a second opinion.

If I read the code correctly, currently, for code points > 255, if the current display encoding is NOT utf-8, they are encoded as utf-8, no matter what is actually the target encoding of the file being edited. This sounds crazy to me for two reasons:

Why encoding into utf-8 if target encoding is NOT utf-8?
Even if you decide to encode into utf-8, why miss code points 127 < x <= 255?

My patch changes it such that code points > 127 will be encoded as utf-8 only if the target encoding is utf-8 AND display encoding is NOT utf-8 (because otherwise they would have been already encoded).

It is not clear to me what exactly would happen under other scenarios after my changes, but I hope that at least it will not make the matters worse than it already is.

It's on those two points (did I fix it correctly for non-utf displays and did I break anything else) that I'd like to get some input.

mc-butler · 2017-09-06T05:32:51Z

Changed by zaytsev (@zyv) on Sep 6, 2017 at 5:32 UTC (comment 11)

Hey Egmont, I'm just back from vacation, is it looking any better for you now and maybe you could have a look into this? --Yury

mc-butler · 2017-09-08T21:57:23Z

Changed by egmont (@egmontkob) on Sep 8, 2017 at 21:57 UTC (comment 12)

Hi guys,

Truly sorry for the delay.

Slightly irrelevant: the opening post says

in mcedit's 'Choose encoding' dialog I select 'UTF-8'

for me that's the default there which I don't think makes sense. If someone fires up mcedit in a Latin2 environment, newly created files should be encoded in Latin2 by default, not UTF-8. What do you think?

Back to our main topic:

lzsiga, you made an excellent investigation, including the table in comment 6 (thanks a lot / köszi szépen! :)) As its last column shows, the current situation is already pretty messed up. Like, char_to_insert is sometimes the Unicode (UCS) value, sometimes each byte of the UTF-8 separately one after the other; it's such a mixture of semantics that shouldn't be done in nice code written from scratch. I know it's lot of legacy code from the 8-bit days with a (relative to that) amazing UTF-8-ification at some point.

I'm inclined to say that this code (and probably its greater area - maybe entire mc? haha) probably deserves a big UTF-8 cleanup. Or rewrite from scratch :P But not now.

As for quick fix, edit2.patch seems to be okay for me and seems to do the best we can do there locally.

One thing though: I'd definitely add a comment there pointing to this ticket.

Thanks a lot to all of you guys for working on this bug!

mc-butler · 2017-09-09T00:02:38Z

Changed by egmont (@egmontkob) on Sep 9, 2017 at 0:02 UTC (comment 13)

Note:

In 8-bit mode (when both the terminal and the file encoding are 8-bit), for the input stream, mc does not follow the "new" philosophy of Unicode that it's the human-facing letter that matters. It still follows the "old" philosophy that just thinks in terms of bytes.

This may or may not be a good thing. Especially that it seems to follow the "new" philosophy for the output.

Example: Set up a Latin2 environment as described in the original post. Edit a file whose encoding you set to Latin1.

Type letters that differ in the two, e.g. ő (o double accent; U+151 aka 0xF5 in Latin2). In the screen ^A appears, for whatever reason this seems to denote characters that are unrepresentable (a 0xF5 byte made it into the file, which is õ (o tilda; U+F5 aka 0xF5 in Latin1) which is not representable in the current Latin2 terminal.

If you switch the encoding to Latin2, ő (o double accent) will appear.

Regardless of whether you switched the encoding or not, upon saving the file 0xF5 will be placed there.

So, when shown on the screen, it's the letter that matters (it could just transfer a 0xF5 as a 0xF5, but no, it's clever to know that it represents a different glyph due to different encoding, and hence it uses a replacement symbol). This logic is not used for the input though, the arriving byte is taken as-is.

mc-butler · 2017-09-09T00:08:38Z

Changed by egmont (@egmontkob) on Sep 9, 2017 at 0:08 UTC (comment 14)

... it does perform the mapping on the input though if the same character is available in the other charset as well, just at a different position (e.g. replace Latin1 by CP852 in the above example).

mc-butler · 2017-09-09T06:55:28Z

Changed by zaytsev (@zyv) on Sep 9, 2017 at 6:55 UTC (comment 15)

Branch state changed from no branch to on review
Owner set to zaytsev
Status changed from new to accepted

Branch: 3843_8bit_to_utf8_broken
Initial [5ee452e]

mc-butler · 2017-09-09T07:07:05Z

Changed by zaytsev (@zyv) on Sep 9, 2017 at 7:07 UTC (comment 16)

Hey Egmont, thank you very much for looking into this! I agree that the editor deserves a rewrite (hmmm, who did one for the viewer, do you happen to know him ;-) ?), and there is this topic of syntax highlighting (and now yet another crazy idea, maybe offload it to Lua ;-) )... but with my new job I can't even manage to find enough time to review Andrew's changes and help him more with maintenance, not speaking of larger projects :-( In the mean time, I understood you're of the opinion my patch is "safe" to commit, hence the branch.

mc-butler · 2017-09-09T07:30:57Z

Changed by egmont (@egmontkob) on Sep 9, 2017 at 7:30 UTC (comment 16.17)

Replying to zaytsev:

I agree that the editor deserves a rewrite (hmmm, who did one for the viewer, do you happen to know him ;-) ?)

Don't even think about it :P :D

In the mean time, I understood you're of the opinion my patch is "safe" to commit, hence the branch.

Yup, definitely, thanks!

I'll open a new bug for my additional findings.

mc-butler · 2017-09-09T09:00:55Z

Changed by zaytsev (@zyv) on Sep 9, 2017 at 9:00 UTC (comment 18)

Branch state changed from on review to approved
Votes set to egmont

mc-butler · 2017-09-09T09:02:39Z

Changed by zaytsev (@zyv) on Sep 9, 2017 at 9:02 UTC (comment 19)

Branch state changed from approved to merged
Status changed from accepted to testing
Votes changed from egmont to committed-master
Resolution set to fixed

mc-butler · 2017-09-09T09:05:29Z

Changed by zaytsev (@zyv) on Sep 9, 2017 at 9:05 UTC (comment 20)

Status changed from testing to closed

I'll open a new bug for my additional findings.

See also the multibyte bug I found & described above, but maybe not worth spending the energy on it and I don't have any to spend anyways :(

mc-butler closed this as completed Sep 9, 2017

mc-butler assigned zyv Feb 28, 2025

mc-butler mentioned this issue Jul 26, 2017

Viewer displays invalid UTF-8 characters >127 <256 as corresponding Unicode codepoints #3844

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mcedit, 8-bit terminal, encoding=UTF-8: characters between 0x80 and 0xff are broken #3843

mcedit, 8-bit terminal, encoding=UTF-8: characters between 0x80 and 0xff are broken #3843

mc-butler commented Jul 25, 2017

mc-butler commented Jul 25, 2017

mc-butler commented Jul 25, 2017

mc-butler commented Jul 25, 2017

mc-butler commented Jul 25, 2017

mc-butler commented Jul 26, 2017

mc-butler commented Jul 26, 2017

mc-butler commented Jul 26, 2017

mc-butler commented Jul 26, 2017

mc-butler commented Jul 31, 2017

mc-butler commented Aug 13, 2017

mc-butler commented Aug 13, 2017

mc-butler commented Aug 13, 2017

mc-butler commented Sep 6, 2017

mc-butler commented Sep 8, 2017

mc-butler commented Sep 9, 2017

mc-butler commented Sep 9, 2017

mc-butler commented Sep 9, 2017

mc-butler commented Sep 9, 2017

mc-butler commented Sep 9, 2017

mc-butler commented Sep 9, 2017

mc-butler commented Sep 9, 2017

mc-butler commented Sep 9, 2017

mcedit, 8-bit terminal, encoding=UTF-8: characters between 0x80 and 0xff are broken #3843

mcedit, 8-bit terminal, encoding=UTF-8: characters between 0x80 and 0xff are broken #3843

Comments

mc-butler commented Jul 25, 2017

mc-butler commented Jul 25, 2017

mc-butler commented Jul 25, 2017

mc-butler commented Jul 25, 2017

mc-butler commented Jul 25, 2017

mc-butler commented Jul 26, 2017

mc-butler commented Jul 26, 2017

mc-butler commented Jul 26, 2017

mc-butler commented Jul 26, 2017

mc-butler commented Jul 31, 2017

mc-butler commented Aug 13, 2017

mc-butler commented Aug 13, 2017

mc-butler commented Aug 13, 2017

mc-butler commented Sep 6, 2017

mc-butler commented Sep 8, 2017

mc-butler commented Sep 9, 2017

mc-butler commented Sep 9, 2017

mc-butler commented Sep 9, 2017

mc-butler commented Sep 9, 2017

mc-butler commented Sep 9, 2017

mc-butler commented Sep 9, 2017

mc-butler commented Sep 9, 2017

mc-butler commented Sep 9, 2017