Discussion:
[VM] displaying text/html with utf-8 via w3m
Ralf Fassel
2014-07-22 08:25:17 UTC
Permalink
emaccs 23.2.1, VM 8.2.0b

Could some kind give me a hint where to look:

I get mails with headers
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

with the body e.g.

<br>
Bin auf Gesch=C3=A4ftsreise.
<br>

(The =C3=A4 in there is lowercase umlaut-a).

After Decoding via w3m this looks like this in the INBOX Presentation
buffer

Bin auf Gesch ftsreise.

The umlaut-a has been replaced by a space. If I run the text manually
through w3m, the ouput still contains the utf-8 character, just the
HTML-markup is gone. But after inserting in the Presentation buffer the
umlaut is changed to a space.

This also happens on emacs -q -no-site-file, so I guess it is something
more basic than my customizations.

Any clues?

TNX
R'
Uday Reddy
2014-07-25 22:58:01 UTC
Permalink
Post by Ralf Fassel
The umlaut-a has been replaced by a space. If I run the text manually
through w3m, the ouput still contains the utf-8 character, just the
HTML-markup is gone. But after inserting in the Presentation buffer the
umlaut is changed to a space.
The first thing to check would be whether it is a problem with Emacs. What
happens if you put the w3m output in a file and visit it in Emacs?

If it is not a problem with Emacs, then it must be a bug in VM. Please file
a bug report along with a sample message.

Cheers,
Uday
Ralf Fassel
2015-01-29 09:57:51 UTC
Permalink
* Uday Reddy <***@gmail.com>
| Ralf Fassel writes:
| > The umlaut-a has been replaced by a space. If I run the text manually
| > through w3m, the ouput still contains the utf-8 character, just the
| > HTML-markup is gone. But after inserting in the Presentation buffer the
| > umlaut is changed to a space.
| The first thing to check would be whether it is a problem with Emacs. What
| happens if you put the w3m output in a file and visit it in Emacs?
| If it is not a problem with Emacs, then it must be a bug in VM. Please file
| a bug report along with a sample message.

Ok, now I found the time to dig into this. It is a mismatch between
encoding written to the w3m process (emacs'
default-process-coding-system) and what w3m is told to expect (-I).

The message itself has
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: Quoted-Printable

Now VM prepares the message for w3m by calling
'vm-mime-display-internal-text/html'
which does
(vm-mime-transfer-decode-region layout start end)
(vm-mime-charset-decode-region charset start end)

So now the region contains utf-8.

Then w3m is called via
'vm-mime-display-internal-w3m-text/html'
which uses 'shell-command-on-region'.

However, shell-command-on-region is documented as

By default, the input (from the current buffer) is encoded using
coding-system specified by `process-coding-system-alist', falling
back to `default-process-coding-system' if no match for COMMAND is
found in `process-coding-system-alist'.

In my setting process-coding-system-alist is nil, but
'default-process-coding-system' is (iso-latin-9-unix . iso-latin-9-unix).
So what is really sent to w3m is latin-9, but w3m is told to process the
input as UTF-8. This replaces the latin-9 non-ASCII chars by " ".

If I temporarily set
(default-process-coding-system '(utf-8 . utf-8))
then the Presentation buffer contains the correct decoded mail message.

IMHO 'vm-mime-display-internal-w3m-text/html' should temporarily adjust
the default-process-coding-system to match what w3m is told to expect.

Sample mail message available on request...

HTH
R'
Ralf Fassel
2015-01-30 12:14:18 UTC
Permalink
* Ralf Fassel <***@gmx.de>
| In my setting process-coding-system-alist is nil, but
| 'default-process-coding-system' is (iso-latin-9-unix . iso-latin-9-unix).
| So what is really sent to w3m is latin-9, but w3m is told to process the
| input as UTF-8. This replaces the latin-9 non-ASCII chars by " ".
| If I temporarily set
| (default-process-coding-system '(utf-8 . utf-8))
| then the Presentation buffer contains the correct decoded mail message.
| IMHO 'vm-mime-display-internal-w3m-text/html' should temporarily adjust
| the default-process-coding-system to match what w3m is told to expect.

In GNU Emacs, this works for me (don't know for Xemacs):

--- vm-8.2.0b/lisp/vm-mime.el~ 2011-12-27 23:19:28.000000000 +0100
+++ vm-8.2.0b/lisp/vm-mime.el 2015-01-30 13:12:45.488159066 +0100
@@ -2719,7 +2719,9 @@
part))

(defun vm-mime-display-internal-w3m-text/html (start end layout)
- (let ((charset (or (vm-mime-get-parameter layout "charset") "us-ascii")))
+ (let* ((charset (or (vm-mime-get-parameter layout "charset") "us-ascii"))
+ (cds (coding-system-from-name charset))
+ (default-process-coding-system (if cds (cons cds cds) default-process-coding-system)))
(shell-command-on-region
start (1- end)
(format "w3m -dump -T text/html -I %s -O %s" charset charset)

Diff finished. Fri Jan 30 13:12:51 2015


HTH
R'

Loading...