[VM] searching in mime encoded email

Discussion:

[VM] searching in mime encoded email

John Hein

2012-01-18 23:18:39 UTC

It seems I'm wanting to get some of my vm wish list items
out in the open today...

M-s is a wonderful search tool for a vm folder, but it searches the
encoded mime (i.e., the gobbledy-gook) instead of the decoded mime.
Given the ever increasing [it seems to me] usage of base64 even for
plain text messages (particularly from certain mobile devices), I
wonder how hard it would be to update the search to decode mime as it
chugs along. It's fairly rare that I need to search for a string in
encoded mime ;)

I guess one could also write a feature to resave a base64 encoded
attachment as some other encoding (hey, how about 7 bit ascii? instead
of base64-ified text/plain). This would be an edited message, but
that might be a nice feature.

vm-isearch-presentation does help slightly with this, but only
searches the current message, of course. If you want to search
through an entire folder, vm-isearch-presentation won't be of
assistance.

While we're at it, supporting search in virtual folders would be extra
cool.

There... two more wishlist items!
Or two and a half? (the half being the mime re-encode concept)
These may be listed somewhere already, but a quick glance at
https://bugs.launchpad.net/vm didn't find them.

r***@knighten.org

2012-01-19 08:06:43 UTC

Post by John Hein
. . .
I guess one could also write a feature to resave a base64 encoded
attachment as some other encoding (hey, how about 7 bit ascii? instead
of base64-ified text/plain). This would be an edited message, but
that might be a nice feature.
. . .
There... two more wishlist items!
Or two and a half? (the half being the mime re-encode concept)

I've been intending to ask for some time if anyone has coded up something that
would allow automatic saving of messages with the Content-Transfer-Encoding
(either base64 or quoted-printable). It doesn't look hard to do, but I've
neither done it myself nor found it elsewhere.

-- Bob

--
Robert L. Knighten
***@knighten.org

Uday Reddy

2012-01-19 09:08:09 UTC

Post by r***@knighten.org
I've been intending to ask for some time if anyone has coded up something that
would allow automatic saving of messages with the Content-Transfer-Encoding
(either base64 or quoted-printable). It doesn't look hard to do, but I've
neither done it myself nor found it elsewhere.

You can use V C header to search for strings in the message headers, e.g.,
"base64" or "quoted-printable".

Then it is a matter of marking the messages and saving them.

Perhaps we should have a way of marking messages based on selectors so that
we can bypass the creation of a virtual folder.

Good ideas!

Cheers,
Uday

r***@knighten.org

2012-01-19 09:34:16 UTC

Post by Uday Reddy

Post by r***@knighten.org
I've been intending to ask for some time if anyone has coded up something that
would allow automatic saving of messages with the Content-Transfer-Encoding
(either base64 or quoted-printable). It doesn't look hard to do, but I've
neither done it myself nor found it elsewhere.

You can use V C header to search for strings in the message headers, e.g.,
"base64" or "quoted-printable".
Then it is a matter of marking the messages and saving them.
Perhaps we should have a way of marking messages based on selectors so that
we can bypass the creation of a virtual folder.
Good ideas!
Cheers,
Uday

Sorry, my message was missing a critical phrase - I want the relevant parts of
the message decoded before the messages are saved. Right now I do this several
times a day using a couple of crude keyboard macros, so I expect I can fully
automate this but perhaps this has already been done?

Thank you Uday for revitalizing VM. I've been using it for many, many years
and I was fearful for a time that it was going to be lost.

-- Bob

--
Robert L. Knighten
***@knighten.org

John Hein

2012-01-19 14:14:08 UTC

Post by r***@knighten.org
Sorry, my message was missing a critical phrase - I want the relevant parts of
the message decoded before the messages are saved. Right now I do this several
times a day using a couple of crude keyboard macros, so I expect I can fully
automate this but perhaps this has already been done?

Yep, that was the the "half" wishlist item I mentiond in the OP.
Since I'm not the only one who wants it, I'll promote it to
full wishlist status. It's nice to be able to use offline
search tools (grep [1]) on mail and base64 makes that hard.

I think we should mark the new saved message as 'edited' since it's
different than the original. It will also be interesting when
"illegal" characters appear in the decoding. We could only allow the
re-coding for text/ mime, but wrong mime type hints are known to
happen. Maybe re-coding to quoted-printable? Or refuse to recode
when non-printable characters show up, but that may be hard to do.

[1] Insert list of good non-plain-text search tools here.

Julian Bradfield

2012-01-19 14:54:32 UTC

Post by John Hein
I think we should mark the new saved message as 'edited' since it's
different than the original. It will also be interesting when
"illegal" characters appear in the decoding. We could only allow the
re-coding for text/ mime, but wrong mime type hints are known to
happen. Maybe re-coding to quoted-printable? Or refuse to recode
when non-printable characters show up, but that may be hard to do.

What do you mean by an illegal character? Why would you want to stop
decoding of, say, PDFs to binary? It would save time and space later.
The main problem with textual search is that the character encoding
may vary from message to message, and even from part to part within a
MIME message. Because VM folders are, and have to be, binary, you
can't search for non-ASCII characters within a folder. I don't see a
good solution to this, excepting transcoding everything to utf-8
before saving.

John Hein

2012-01-19 16:44:44 UTC

Post by Julian Bradfield

Post by John Hein
I think we should mark the new saved message as 'edited' since it's
different than the original. It will also be interesting when
"illegal" characters appear in the decoding. We could only allow the
re-coding for text/ mime, but wrong mime type hints are known to
happen. Maybe re-coding to quoted-printable? Or refuse to recode
when non-printable characters show up, but that may be hard to do.

What do you mean by an illegal character? Why would you want to stop
decoding of, say, PDFs to binary? It would save time and space later.
The main problem with textual search is that the character encoding
may vary from message to message, and even from part to part within a
MIME message. Because VM folders are, and have to be, binary, you
can't search for non-ASCII characters within a folder. I don't see a
good solution to this, excepting transcoding everything to utf-8
before saving.

Short answer: I'm not sure what the best solution might be either.
And I'm not suffering from any delusions that this would be a simple
task with a one-size-fits-all solution.

I guess an 'illegal' character would be a character that does not
belong in an email message per RFC. Perhaps put another way - no
character that would choke a mail reader such as vm or other mail
handler.

Longer...
That said, when "exporting" a message, one may have plans to use it
outside a mail reader, but that's beyond the scope of what I was
thinking. And we more or less have a tool for that already (to save
mime parts) - vm-mime-save-all-attachments. That doesn't re-save the
parts in place in the message, of course.

So transcoding to utf-8 would probably be out since you can't have raw
utf-8 in an email message and expect all email handling tools to be
happy with it. That said, grep (and emacs?) can be told to search
utf-8 input, so it would be useful at some level. Can vm handle
messages with raw utf-8 in the body?

Changing a plain text base64 that only has 7-bit ascii in the decoded
stream to 7-bit ascii encoding and resaving the mime part with the
appropriate encoding hint would be one example of the sort of "legal"
transformation I had in my head.

Transforming to quoted-printable if possible seems legal as well and
opens up the space beyond 7-bit ascii. Grepping through encoded
quoted-printable may be useful in many cases as long as you
set your expectations appropriately (e.g., = is =3d).

Generally, I only care about re-encoding to something that will make
it easier to use grep or the like. So saving a pdf to a hunk of
binary within the message would be something I didn't want. If I want
to search a pdf I generally have to use some other tool (from
strings(1) at a minimum for certain pdfs to pdftotext to interactive
pdf reading tools) anyway, so re-saving a pdf mime part to binary then
using the tool isn't really better than having to add a mime decoder
in front of that tool.

I sometimes find myself using vm-edit-message 'mimencode -u -b' on a
mime part for various needs. This imagined re-encoder would help
with that in addition to times when I use $ | with various tools
to muck with decoded mime parts of a message.

Uday Reddy

2012-01-19 17:12:12 UTC

Post by Julian Bradfield
What do you mean by an illegal character? Why would you want to stop
decoding of, say, PDFs to binary? It would save time and space later.
The main problem with textual search is that the character encoding
may vary from message to message, and even from part to part within a
MIME message. Because VM folders are, and have to be, binary, you
can't search for non-ASCII characters within a folder. I don't see a
good solution to this, excepting transcoding everything to utf-8
before saving.

I think it is only the text parts that concern us here, because we want to
be able to search for text.

Saving the folders in UTF-8 or whatever (the actual coding system doesn't
matter), in a non-RFC format, seems to be needed to support search.

Does anybody know how the other mail clients do it, e.g., Thunderbird?

Cheers,
Uday

John Hein

2012-01-19 22:27:27 UTC

Post by Uday Reddy
Saving the folders in UTF-8 or whatever (the actual coding system doesn't
matter), in a non-RFC format, seems to be needed to support search.
Does anybody know how the other mail clients do it, e.g., Thunderbird?

Thunderbird does (now) search through the base64 in multiple messages
(https://bugzilla.mozilla.org/show_bug.cgi?id=132340)... added in
2007-ish.

The quick search in TB doesn't work with (my) imap server if you
specify too many search qualifications (more than one header +
body?)... 'Invalid search parameters: (OR (OR FROM "foo" OR TO "foo"
HEADER CC "foo".' It seems better with local folders. It's a little
buggy but mostly works.

In any case, TB doesn't re-save messages in an unencoded format. It
seems to do it in-memory on the fly. Rudimentary confirmation with
strace.

It looks like they only do it on text/* parts.

I think that seems like the right way to go in vm, too (although
allowing the user to save a re-encoded part might be a nice feature,
too).

Before TB added this, I think it did work for searching base64 in
individual messages - "Find in this message" (much like
vm-isearch-presentation works now in vm, but folder-wide searches
don't). But that's just my recollection - I could be remembering
wrong.

It doesn't seem to work right if the base64 encoded text/* is the
whole message (this is the opposite of how vm-save-message-preview
currently misbehaves in vm where it successfully decodes whole
message encodings, but not mime parts).

Uday Reddy

2012-01-19 15:51:15 UTC

Post by r***@knighten.org
Sorry, my message was missing a critical phrase - I want the relevant
parts of the message decoded before the messages are saved. Right now I
do this several times a day using a couple of crude keyboard macros, so I
expect I can fully automate this but perhaps this has already been done?

I see a function called vm-save-message-preview in vm-rfaddons.el. Please
try it.

If you find any functions in the add-ons that you think should be integrated
into VM, please tell me and perhaps provide a good doc string and some text
to go into the info manual. I will be happy to integrate them.

More generally, I am thinking that there is no reason why we can't have VM
folders stored in some other character set, other than US-ASCII, e.g.,
UTF-8. Those folders won't be interoperable with other mail clients, but do
we care about that really? That should be a big win for people that need to
use international character sets regularly. MIME-decoded text can then be
stored directly into folders and all normal Emacs searching functions will
work.

This needs some thought and discussion, and it will need some amount of
careful reengineering effort. The assumption about 7-bit US-ASCII is
probably pervasive in a lot of VM code. So, it will need extensive testing,
and I would need people that are willing to participate in that. There
could be possibility of corruption of mail folders. They would need to back
up email carefully. But, it could be a big win in the long run.

Cheers,
Uday

Mark Diekhans

2012-01-19 17:38:07 UTC

Hi Uday,

For me, compatibility with mbox format is feature. I often use
other applications to scan VM mail folders, such a procmail. So
it's not necessarily just mail reading clients. A program to
convert back to mbox would suffice. Although it would still
remain important for VM to be able to read mbox files without
modifying them.

Although, if a non-mbox format is what is required to take VM to
the next level, it might be time to think bigger.

For example, Currently I have gigabytes of automatically create
mbox folders that I need to be able to search and read. My
current approach is to use C program to search for patterns and
then save the matching messages to a mbox file and then open
that file in VM. Of course, this is very sub-optimal. I have
been trying to determine how to best address this.

With imap support, VM is moving towards an abstraction between
the lisp code and mailbox format. It seems that moving
everything toward using some kind of abstraction layer that can
then be mapped on top of mbox, imap, etc, which could do things
like character set transforms, could enable a lot of new things.

Mark

Post by Uday Reddy
More generally, I am thinking that there is no reason why we can't have VM
folders stored in some other character set, other than US-ASCII, e.g.,
UTF-8. Those folders won't be interoperable with other mail clients, but do
we care about that really? That should be a big win for people that need to
use international character sets regularly. MIME-decoded text can then be
stored directly into folders and all normal Emacs searching functions will
work.

Uday Reddy

2012-01-19 18:17:27 UTC

Post by Mark Diekhans
For me, compatibility with mbox format is feature. I often use
other applications to scan VM mail folders, such a procmail. So
it's not necessarily just mail reading clients. A program to
convert back to mbox would suffice. Although it would still
remain important for VM to be able to read mbox files without
modifying them.

No, there is no attempt to get rid of the mbox format. Recall that the
original question was "how can I save MIME-decoded text?" One option is to
decode it and leave it in the folder. Then the folder becomes non-RFC (not
7-bit ASCII any more). Another option is to decode it and save it to some
other (non-RFC) folder. A worse option would be to decode it and save it in
an independent file. I don't see how saving it as an independent file would
be good for anything. Mail folders are much better for organizing,
categorizing and searching email than file system directory trees are. In
fact, I think that in the long run, VM will probably turn into a
semantically organized file system for email. We don't want to give up this
semantic organization in order to satisfy some RFC, which is a historical
relic really.

Post by Mark Diekhans
For example, Currently I have gigabytes of automatically create
mbox folders that I need to be able to search and read. My
current approach is to use C program to search for patterns and
then save the matching messages to a mbox file and then open
that file in VM. Of course, this is very sub-optimal. I have
been trying to determine how to best address this.

Mairix is an email indexing tool with a VM interface. Check out this blog
post, for example:

http://robert-adesam.blogspot.com/

Cheers,
Uday

Mark Diekhans

2012-01-21 22:57:50 UTC

Hi Uday,

Post by Uday Reddy
No, there is no attempt to get rid of the mbox format.

Understood. I am only worried about incompatibilities that
break other software. Hopefully this is an unnecessary worry.

Post by Uday Reddy
Recall that the
original question was "how can I save MIME-decoded text?" One option is to
decode it and leave it in the folder. Then the folder becomes non-RFC (not
7-bit ASCII any more).

I suspect this will break very little software. As long as
there is a path back to a 7-bit ASCII, not worried at all.
I wouldn't even be worried about non-mbox as long as there
was a bidirectional path to mbox.

Post by Uday Reddy
Mail folders are much better for organizing, categorizing and
searching email than file system directory trees are. In
fact, I think that in the long run, VM will probably turn into
a semantically organized file system for email.

Absolutely. maildir doesn't scale beyond a tiny amount of mail.
For a lot of us, mbox is reaching it's limits. This is why I
was musing about an abstraction between an underlying storage
format being a way to scale vm to much larger mail collections.

Post by Uday Reddy
Mairix is an email indexing tool with a VM interface. Check out this blog
http://robert-adesam.blogspot.com/

Thanks for the pointer. I had noticed mairix, but not the
interface to VM, which is an absolute requirement.

Cheers,
Mark

Uday Reddy

2012-01-21 23:35:38 UTC

Post by Mark Diekhans
Absolutely. maildir doesn't scale beyond a tiny amount of mail.
For a lot of us, mbox is reaching it's limits. This is why I
was musing about an abstraction between an underlying storage
format being a way to scale vm to much larger mail collections.

I am not sure that mbox is that much of a limit, once you figure out how to
archive old mail. I keep about 3 months worth of email in my "INBOX" and
archive older mail into 3 month quantities of mbox's. Virtual folders then
help me combine these mbox's into whatever combinations I want. Virtual
folders are definitely your friend here.

Mail older than a year or so needs to be "cleaned", throwing out junk,
saving important stuff and quickly figuring out what else needs to be kept.
I don't yet have good ways of doing it, but I am making some progress. The
recent addition of V T search folders is part of that effort. I am also
looking at adopting and enhancing various features in Rob's vm-avirtual.el
library.

Then we have the IMAP folders and external (headers-only) message feature of
VM. These already provide a powerful "abstraction layer" but only for IMAP
users. Eventually, this kind of thing will also become possible to other
forms of external storage: maildir etc.

Cheers,
Uday

m***@kermodei.com

2012-01-28 21:21:51 UTC

Hi Uday,

I am changing the subject of this mail thread to be more
appropriate. Hopefully, it's of interest to others and not just
overloading more mailboxes.

Post by Uday Reddy
I am not sure that mbox is that much of a limit, once you figure out how to
archive old mail.

It's not the format so much as the need to read an entire mbox
into memory.

Post by Uday Reddy
I keep about 3 months worth of email in my "INBOX" and
archive older mail into 3 month quantities of mbox's. Virtual folders then
help me combine these mbox's into whatever combinations I want. Virtual
folders are definitely your friend here.

I am way beyond being able to sort through mail to decide what
to keep. I just automatically archive everything on delivery
and keep per-month folders. Things I need to deal with stay in
the INBOX and get periodically flushed when it gets too big.

It is much less work to just archive everything and dig things
out as needed. It's frequently been a surprise what mail was
needed from the archive. Not necessarily stuff I would have
saved if I manually archiving.

The volume has gotten absurd, especially when I have to work on
grants with people who insist on e-mailing around Word documents
as a work flow. My archive for December is approaching a
gigabyte :-(

This is way beyond anything that will work with reading mbox
files into memory. Even the current INBOX gets overload and
ends up spending way too much time garbage collecting.

What I would really like is the ability to do start a search on
the archive within vm, get back a list of hits in something
similar to a summary buffer and then look through them with vm
as if normal email.

Current thinking is that running a private imap server on top of
my mail archive might be an approach.

Post by Uday Reddy
Then we have the IMAP folders and external (headers-only) message feature of
VM. These already provide a powerful "abstraction layer" but only for IMAP
users. Eventually, this kind of thing will also become possible to other
forms of external storage: maildir etc.

YES! This is exactly the example it had in mind. Forgive me
for speculating, since I don't know a lot of the internals of
VM. What I am seeing VM starting to evolve into an user
interface to various mail back ends rather than an internal
format and the special-case external ones. Better integrating
powerful back-end search engines could come from this.

Cheers,
Mark

Uday Reddy

2012-01-30 17:56:16 UTC

Post by m***@kermodei.com
I am way beyond being able to sort through mail to decide what
to keep. I just automatically archive everything on delivery
and keep per-month folders. Things I need to deal with stay in
the INBOX and get periodically flushed when it gets too big.
It is much less work to just archive everything and dig things
out as needed. It's frequently been a surprise what mail was
needed from the archive. Not necessarily stuff I would have
saved if I manually archiving.

I personally use the term "archiving" to mean saving away mail without
making any decisions about what to keep. VM's use of "archiving" is a bit
different. `vm-auto-archive-messages' saves messages using
`vm-auto-folder-alist' (which is also used for manual saving). So, it is
closer to saving than archiving.

Post by m***@kermodei.com
The volume has gotten absurd, especially when I have to work on
grants with people who insist on e-mailing around Word documents
as a work flow. My archive for December is approaching a
gigabyte :-(

The solution for that problem is to save away attachments and delete them
from the mbox. vm-rfaddons has a function called
`vm-mime-auto-save-all-attachments' which can be invoked automatically from
`message-arrived-hook'. Of course, you can invoke it manually as well. If
you get a lot of attachments, you should definitely look into this
solution.

Post by m***@kermodei.com
What I would really like is the ability to do start a search on
the archive within vm, get back a list of hits in something
similar to a summary buffer and then look through them with vm
as if normal email.
Current thinking is that running a private imap server on top of
my mail archive might be an approach.

That is definitely the right strategy. But, unfortunately, I don't yet have
a front-end for IMAP search implemented in VM. It won't be before the
summer that I can get to it, because things like this need bigger blocks of
time

However, there is also the solution of mairix which was mentioned a couple
of times in this thread. It works with local folders. And, there is a VM
front-end to it. If somebody can experiment with mairix and give me a
section to include in the info file, that will be great!

Cheers,
Uday

John Hein

2012-01-19 18:42:56 UTC

Post by Uday Reddy

Post by r***@knighten.org
Sorry, my message was missing a critical phrase - I want the relevant
parts of the message decoded before the messages are saved. Right now I
do this several times a day using a couple of crude keyboard macros, so I
expect I can fully automate this but perhaps this has already been done?

I see a function called vm-save-message-preview in vm-rfaddons.el.
Please try it.

That doesn't seem to work for mime parts, but it does save the message
decoded if the entire message was, for instance, marked as encoded
with base64.

It does not save it as a message in a vm folder or even

Uday Reddy

2012-01-19 19:14:56 UTC

Good experiment, John. It is interesting that VM didn't choke on the UTF-8
message whereas Thunderbird did! I should save a link to this UTF-8 demo
file.

Both thunderbird and vm can search the individual message and find
utf-8 characters (vm-isearch-presentation in vm) in the raw utf-8
stream (and both can search the individual message as well when it's
base64 encoded).

vm-isearch-presentation doesn't count, because it is not searching in the
message text but rather in the message presentation. The presentation
buffer is in Emacs native encoding, not UTF-8. If M-s and V C text work, I
would be more impressed. If they work on all mule-capable versions of
Emacs, I would be even more impressed!

Cheers,
Uday

Julian Bradfield

2012-01-19 22:06:39 UTC

Post by Uday Reddy
More generally, I am thinking that there is no reason why we can't have VM
folders stored in some other character set, other than US-ASCII, e.g.,
UTF-8. Those folders won't be interoperable with other mail clients, but do

...

Post by Uday Reddy
careful reengineering effort. The assumption about 7-bit US-ASCII is
probably pervasive in a lot of VM code. So, it will need extensive testing,

I have no idea what you're talking about! VM makes no assumptions at
all about the character set of its folders, except that that message
headers are (as required) in ASCII - many of my patches over
the last few years have been removing the accidental cases where it
failed to enforce its agnosticism.

VM folders are simply binary files. The character set of a given
message - or subpart of a message - is determined by its MIME charset.

If you wanted, you could transcode all non-utf-8 parts to utf-8, but
the folder would still be a binary file; it would just be a binary
file that happened also to be valid utf-8 as a whole.

John Hein

2012-01-19 23:04:26 UTC

Post by Julian Bradfield

Post by Uday Reddy
More generally, I am thinking that there is no reason why we can't have VM
folders stored in some other character set, other than US-ASCII, e.g.,
UTF-8. Those folders won't be interoperable with other mail clients, but do

....

Post by Uday Reddy
careful reengineering effort. The assumption about 7-bit US-ASCII is
probably pervasive in a lot of VM code. So, it will need extensive testing,

I have no idea what you're talking about! VM makes no assumptions at
all about the character set of its folders, except that that message
headers are (as required) in ASCII - many of my patches over
the last few years have been removing the accidental cases where it
failed to enforce its agnosticism.
VM folders are simply binary files. The character set of a given
message - or subpart of a message - is determined by its MIME charset.
If you wanted, you could transcode all non-utf-8 parts to utf-8, but
the folder would still be a binary file; it would just be a binary
file that happened also to be valid utf-8 as a whole.

I hope vm can handle binary bodies okay. I suspect there may be some
edge cases (e.g., embedded '\n\n

Julian Bradfield

2012-01-20 08:10:00 UTC

Post by John Hein
I hope vm can handle binary bodies okay. I suspect there may be some
edge cases (e.g., embedded '\n\n

John Hein

2012-01-20 19:44:12 UTC

But there are ramifications to having raw binary data in email
messages beyond the scope of vm. Imagine reading a message with raw
binary data in an xterm with emacs -nw - see your terminal window go

Most people don't insert random binary junk into the text of their
messages, so why would you be reading non-text data?

We're talking about re-saving mime sections so they have something
other than 7-bit ascii. That's not necessarily non-text data (if it
really is text/plain valid utf-8 under the covers for instance), but
if you view it in such a way not prepared to handle the "binary" data,
this can cause problems. That is, we're talking about the possibility
of having a feature in vm to "insert binary data".

This whole issue about re-saving email is a side path to the main
issue that started this thread - reaching into base64 encoded content
for searches.

I think re-formatting base64 content and saving it as something else
may have its uses, but it's got some rough edges (that we've mentioned
in this thread) that make it not very desirable solution as the
primary way to get searching in base64 content working in vm.

Ullrich's mention about re-saving messages messing with signed
email is another good example of the "rough edges", too.

Julian Bradfield

2012-01-20 20:23:30 UTC

Post by John Hein
We're talking about re-saving mime sections so they have something
other than 7-bit ascii. That's not necessarily non-text data (if it
really is text/plain valid utf-8 under the covers for instance), but
if you view it in such a way not prepared to handle the "binary" data,
this can cause problems. That is, we're talking about the possibility
of having a feature in vm to "insert binary data".

But if you ever get any non-English mail, you may well already have
"binary data" in your text.

I don't see how your worry is any difference from saying "if you do
cat /bin/ls
in an xterm, weird stuff will happen". Of course it will; so what?
If it's actually utf-8, you'll probably see it as intended, as every
modern distribution is set up to use utf-8 by default.

John Hein

2012-01-21 02:28:07 UTC

Post by Julian Bradfield
But if you ever get any non-English mail, you may well already have
"binary data" in your text.

Indeed. Some, not much, often with mismarked encoding.

Post by Julian Bradfield
I don't see how your worry is any difference from saying "if you do
cat /bin/ls
in an xterm, weird stuff will happen". Of course it will; so what?
If it's actually utf-8, you'll probably see it as intended, as every
modern distribution is set up to use utf-8 by default.

Most of the non-english email with "binary" payload I get is base64
encoded or quoted-printable or q-encoding in headers. Does that mean
there aren't mailers out there sending raw binary or converting from
an encoding to binary before delivering it in the user's inbox? No,
just none that I've seen (or noticed at least) yet.

Do I see 8-bit data despite the sender lying with
'Content-Transfer-Encoding: 7bit'? Most definitely - typically things
from misbehaving mailers like 0xa0 (non-breaking space) and 0x92
(seems to be a right single quote in windows-1252, but certainly not
iso-8859-1 like the message I'm looking at claims) and the like.
Except for spam / virus payloads, these broken elements are
fairly innocuous, however.

Will it always be the case that base64 is used instead of raw binary?
No - some day raw binary may flow freely over email channels. And I
agree vm should be prepared for it. The biggest issue might be
migrating away from the default mbox format for local folders
(separate topic).

Re: so what? If the binary data is marked with a proper mime type and
encoding, and I always use a tool (e.g., vm) that knows what to do with
that chunk of mime, then I agree it really is a dont-care. Those
conditions don't always hold true. But even then (mismarked encoding
like application/pdf marked as text/plain or someone uses cat(1) on a
message in an xterm perhaps), you could still say "so what? that's
operator error or a bug that should be fixed" and sleep well with
that answer.

Will vm currently handle any case of raw binary in the payload of
a message? As I said earlier, I hope so, but I wouldn't be surprised
if it didn't (mainly due to aforementioned storage format).

But going back to the questions at hand.

(a) Whether or not to add a feature in vm to support re-encoding
base64 sections (or any arbitrary mime section) to some other
encoding.

(b) Whether to fix M-S and/or V C text to grok base64 or other
transfer encodings (not to mention complications due to
character sets)...
(1) on the fly in-memory
(2) by way of doing (a)

I think (b)(1) is best if it can be made "not slow" in vm.

If someone adds (a), that'd probably be okay and useful in certain
circumstances, but the user would have to be explicitly aware of any
consequences when he decides to invoke that feature (e.g.,
invalidating of signed messages, possible mail storage issues, etc.).
I don't think vm should automatically do any permanent re-encoding
of messages or message parts for its own needs (e.g., (b)(2)).

Julian Bradfield

2012-01-21 09:50:05 UTC

Post by John Hein
Most of the non-english email with "binary" payload I get is base64
encoded or quoted-printable or q-encoding in headers. Does that mean
there aren't mailers out there sending raw binary or converting from
an encoding to binary before delivering it in the user's inbox? No,
just none that I've seen (or noticed at least) yet.

Are you American?
Out here in the rest of the world, 10% (almost exactly) of my incoming
mail arrives with Content-Transfer-Encoding: 8bit

Post by John Hein
No - some day raw binary may flow freely over email channels. And I

It does. Over here in Yerp we've been using ESMTP for decades - hell,
even AOL uses ESMTP, and if the most primitive mail service in America can do
it, surely the rest of American does too!

Post by John Hein
Will vm currently handle any case of raw binary in the payload of
a message? As I said earlier, I hope so, but I wouldn't be surprised
if it didn't (mainly due to aforementioned storage format).

It certainly should - modulo the message delimiter problem, which is
non-trivial.

Post by John Hein
(a) Whether or not to add a feature in vm to support re-encoding
base64 sections (or any arbitrary mime section) to some other
encoding.
(b) Whether to fix M-S and/or V C text to grok base64 or other
transfer encodings (not to mention complications due to
character sets)...
(1) on the fly in-memory
(2) by way of doing (a)
I think (b)(1) is best if it can be made "not slow" in vm.
If someone adds (a), that'd probably be okay and useful in certain
circumstances, but the user would have to be explicitly aware of any
consequences when he decides to invoke that feature (e.g.,
invalidating of signed messages, possible mail storage issues, etc.).
I don't think vm should automatically do any permanent re-encoding
of messages or message parts for its own needs (e.g., (b)(2)).

I agree with that. I do an assortment of munges on my incoming mails
(e.g. stripping out entire quoted messages), but *I* do it, and it's
*my* fault if I lose something. I don't want my mailer doing it!

Uday S Reddy

2012-01-21 10:05:59 UTC

Post by John Hein
Will it always be the case that base64 is used instead of raw binary?
No - some day raw binary may flow freely over email channels. And I
agree vm should be prepared for it. The biggest issue might be
migrating away from the default mbox format for local folders
(separate topic).

How email flows over email channels doesn't concern us really. We are
talking about what to do with it after it arrives.

(Even after the world declares to itself that 8bit is safe, there will still
be some mailers that will keep encoding things into 7bit for eternity.
Let us punt that issue. It is totally irrelevant.)

Cheers,
Uday

Uday Reddy

2012-01-20 01:36:13 UTC

Post by Julian Bradfield
I have no idea what you're talking about! VM makes no assumptions at
all about the character set of its folders, except that that message
headers are (as required) in ASCII - many of my patches over
the last few years have been removing the accidental cases where it
failed to enforce its agnosticism.

I hope you are right, but I can't be sure.

Post by Julian Bradfield
VM folders are simply binary files. The character set of a given
message - or subpart of a message - is determined by its MIME charset.
If you wanted, you could transcode all non-utf-8 parts to utf-8, but
the folder would still be a binary file; it would just be a binary
file that happened also to be valid utf-8 as a whole.

You have lost me there. What do you mean by "binary"?

Cheers,
Uday

Uday Reddy

2012-01-20 01:54:49 UTC

Post by Julian Bradfield

Post by Uday Reddy
More generally, I am thinking that there is no reason why we can't have VM
folders stored in some other character set, other than US-ASCII, e.g.,
UTF-8. Those folders won't be interoperable with other mail clients, but do

VM folders are simply binary files. The character set of a given
message - or subpart of a message - is determined by its MIME charset.
If you wanted, you could transcode all non-utf-8 parts to utf-8, but
the folder would still be a binary file; it would just be a binary
file that happened also to be valid utf-8 as a whole.

Oh, perhaps you are saying they are "binary" as opposed to "ASCII". I think
that is a matter of view point.

You can load a utf-8 folder into VM but, if there are any multibyte codes in
there, they will get interpreted as separate characters. You can't search
them. If it has 8-bit codes, you might be in sligtly better luck. But if
your default is iso-8859-1, and your message text is in iso-8858-X, then you
won't be able to search it either.

You are probably thinking of the folder as being made up of bytes as opposed
to characters. That is a fine view point to take as long as you don't care
to search. But searching is what this thread is about!

Cheers,
Uday

Ulrich Mueller

2012-01-20 07:07:52 UTC

Post by Uday Reddy

Post by Julian Bradfield

Post by Uday Reddy
More generally, I am thinking that there is no reason why we
can't have VM folders stored in some other character set, other
than US-ASCII, e.g., UTF-8. Those folders won't be interoperable
with other mail clients, but do

VM folders are simply binary files. The character set of a given
message - or subpart of a message - is determined by its MIME
charset.
If you wanted, you could transcode all non-utf-8 parts to utf-8,
but the folder would still be a binary file; it would just be a
binary file that happened also to be valid utf-8 as a whole.

Oh, perhaps you are saying they are "binary" as opposed to "ASCII".
I think that is a matter of view point.

I think that Julian is right here. Folders don't have any specific
character encoding, they are simply a stream of bytes. (In terms of
coding systems, it's "raw-text".) Character sets come into play on the
level of individual messages, and they're specified by the message's
(or part's in case of multipart messages) MIME headers.

There's another aspect why general recoding of saved messages might
not be a good idea: A message can be PGP signed, and any change of
encoding will destroy the message's integrity and therefore render the
signature invalid.

Post by Uday Reddy
[...]
You are probably thinking of the folder as being made up of bytes as
opposed to characters. That is a fine view point to take as long as
you don't care to search. But searching is what this thread is
about!

Seems like the search function must MIME-decode each message then.
I've no idea though if doing so would be fast enough.

Before reinventing the wheel, maybe it would be worthwhile to look at
dedicated search tools like mairix. There's also an Emacs interface
for it, see <http://randomsample.de/mairix-el-doc/>.

Ulrich

Uday S Reddy

2012-01-20 23:39:50 UTC

Post by Ulrich Mueller
I think that Julian is right here. Folders don't have any specific
character encoding, they are simply a stream of bytes. (In terms of
coding systems, it's "raw-text".) Character sets come into play on the
level of individual messages, and they're specified by the message's
(or part's in case of multipart messages) MIME headers.

If a folder is simply a stream of bytes then neither M-s nor V-C-text would
make sense. For searching, the question is how those bytes are interpreted
in Emacs. Since 7-bit ASCII is common to all encodings (including Emacs's
own internal encoding), searching happens to work for the ASCII parts. It
doesn't work for the non-ASCII parts. (So, when I say folders are in
"ASCII", and Julian says they are in "binary", we are really saying the same
thing. I am focusing on the 7 bits that can be searched. He is focusing on
the fact that the 8th bit doesn't get munged by Emacs or VM.)

The two questions that were posed in this thread were:

1. how can we save MIME-decoded messages?

2. how can we search in all the text parts of messages?

For the first question, there is no reason at all why VM can't save
MIME-decoded text in files. If a message arrives using an 8-bit character
set or a multi-byte character set, encoded in base64 or quoted-printable,
it should be possible for VM to produce an equivalent message replacing the
MIME-encoded parts with text/plain parts, so that it can be saved
somewhere.

Assuming we have that working, I started wondering where such decoded
messages can be saved: in individual files only, or other VM folders? I
think what Julian and you are saying is that it is perfectly fine to save
them in VM folders. I am coming around to agreeing with that.

However, the second question is now important. Searching in MIME-encoded
text parts is not feasible with our present tool set, but one would hope
that it should become possible at least in the new folders where
MIME-decoded text has been saved. But it is still not possible! It is not
possible for the same reason that you have been crying about. The folders
are streams of bytes. So, it is still only the ASCII parts that can be
searched. Bummer!

So, my idea is to allow VM folders that are *text* files. Internally in
Emacs, they will be in the Emacs internal coding. When they are saved to
disk, they would be saved in an encoding that Emacs chooses or the user
chooses. Full text search should be possible for such folders, including
grep. Otherwise there would be no point in having such folders.

So, if you guys can put your mind to the question of whether such folders
will work, we can make some progress.

Post by Ulrich Mueller
There's another aspect why general recoding of saved messages might
not be a good idea: A message can be PGP signed, and any change of
encoding will destroy the message's integrity and therefore render the
signature invalid.

I would expect that when the user saves decoded messages, they would strip
the PGP signatures and decrypt any encrypted messages. Decryption may not
be always advisable, but it depends on the user what he/she wants to do.

Post by Ulrich Mueller
Before reinventing the wheel, maybe it would be worthwhile to look at
dedicated search tools like mairix. There's also an Emacs interface
for it, see <http://randomsample.de/mairix-el-doc/>.

Indeed, can people that use mairix tell us what is feasible with it?

Cheers,
Uday

Julian Bradfield

2012-01-20 07:26:35 UTC

[Resending from my "vm rocks" gmail account. I hope Julian intended it to
go to the mailing list! -- Uday]

Post by Uday Reddy
Oh, perhaps you are saying they are "binary" as opposed to "ASCII". I think
that is a matter of view point.

No, it's not. You said "7-bit US ASCII", which means what it says: no
bytes with the high bit set, all to be interpreted according to ASCII.

Post by Uday Reddy
You can load a utf-8 folder into VM but, if there are any multibyte codes in
there, they will get interpreted as separate characters. You can't search

Indeed.

Post by Uday Reddy
You are probably thinking of the folder as being made up of bytes as opposed
to characters. That is a fine view point to take as long as you don't care
to search. But searching is what this thread is about!

The folder is made up of bytes. There's no getting round that.
Come to that, *any* file is made up of bytes. It doesn't get converted
to characters until it's read into an Emacs buffer with a given coding
system.
In VM's case, folder files are read into folder buffers with the
binary coding system, and so the folder buffer is also a sequence of
bytes (which are punned with characters 0 to 255).
It's inherently impossible to meaningfully treat a folder as a
sequence of characters - unless you already know that all the
characters in it are represented in the same coding system.

Uday Reddy

2012-01-21 09:53:14 UTC

[Julian sent in a response yesterday, but it wasn't copied to mailing list.
Do you want to re-send it to the mailing list, Julian?]

Post by Julian Bradfield
The folder is made up of bytes. There's no getting round that.
Come to that, *any* file is made up of bytes. It doesn't get converted
to characters until it's read into an Emacs buffer with a given coding
system.
In VM's case, folder files are read into folder buffers with the
binary coding system, and so the folder buffer is also a sequence of
bytes (which are punned with characters 0 to 255).
It's inherently impossible to meaningfully treat a folder as a
sequence of characters - unless you already know that all the
characters in it are represented in the same coding system.

Good, I think we are on the same page now.

To re-state what I wrote last night, I am thinking of VM folders made up of
*characters*. They could be used with a new file extension suffix, e.g.,
".vm". On the disk, they would be in some coding system such as UTF-8.
When they are loaded into folder buffers, they would be text files in
Emacs's internal encoding. (In Gnu Emacs, the "coding system" of a buffer
specifies how to move data in and out of the buffer, in particular how it
should be stored on disk. It has nothing to do with how the characters are
represented inside Emacs. They could be represented how ever Emacs pleases
and we don't care. If XEmacs does it differently, please let me know. I
will check on that.) All the text/plain parts in such a folder would
be in the default coding system of the folder. No "charset" headers. All
the attachments would be MIME-encoded to appear as ASCII. So, they won't
get munged inside Emacs or when they are saved to disk.

The current VM folders are made of *bytes*. But when one saves messages
from a byte folder into a character folder, they get transformed
appropriately. (The signatures are verified and stripped. Encrypted parts?
We will need an option to either decrypt them and store them as "plain text"
or to store them as is.) One could also choose to work with character
folders *all the time*, and transform messages from the byte format to
character format when they arrive.

People being people, they will also want to save messages from ".vm" folders
into byte folders. We either prohibit that, or re-encode the messages as
proper MIME messages before saving into byte folders.

So, I am contesting the idea that it is inherently impossible to have
folders as sequences of characters.

Cheers,
Uday

Julian Bradfield

2012-01-21 11:55:03 UTC

Post by Uday Reddy
[Julian sent in a response yesterday, but it wasn't copied to mailing list.
Do you want to re-send it to the mailing list, Julian?]

That's because you keep replying to me personally when you reply to my
posts, instead of keeping it on the mailing list where it belongs, so I
see it in mail and reply there before I see it on the list.

Post by Uday Reddy
To re-state what I wrote last night, I am thinking of VM folders made up of
*characters*. They could be used with a new file extension suffix, e.g.,
".vm". On the disk, they would be in some coding system such as UTF-8.

In other words, you'd transcode everything to utf-8, and then say "we
know all messages in the folder are in utf-8, so we can load the
folder in utf-8 and bypass per-message decoding", as I remarked
several messages ago.

Post by Uday Reddy
be in the default coding system of the folder. No "charset" headers. All

Why no charset headers? If you're munging a mime message, you should
ensure that it remains a valid mime message.

Post by Uday Reddy
People being people, they will also want to save messages from ".vm" folders
into byte folders. We either prohibit that, or re-encode the messages as
proper MIME messages before saving into byte folders.

Why do this? If you transcode explicitly, the messages are still
proper MIME messages, even if they're in a "character folder" (i.e. a
folder which VM loads using utf-8 (or whatever) coding system rather
than binary).

I don't see why this shouldn't work, but one shouldn't do it by
default. Although Unicode by requirement has an injective mapping from
every legacy standard, it's not the case that Unicode has an injective
mapping from the disjoint union of the legacy standards (Emacs does,
internally). Some Japanese users have very strong feelings about some
of the Japanese/Chinese merges done in Unicode.

Uday Reddy

2012-01-21 22:34:21 UTC

Post by Julian Bradfield
In other words, you'd transcode everything to utf-8, and then say "we
know all messages in the folder are in utf-8, so we can load the
folder in utf-8 and bypass per-message decoding", as I remarked
several messages ago.

My idea is not to transcode in anything. Just decode it to Emacs internal
coding and leave it at that. The folder can be saved to disk using whatever
coding Emacs or the users chooses. It could be an ISO coding, for instance.
The coding used to save the folder is not VM's concern.

Post by Julian Bradfield
Why no charset headers? If you're munging a mime message, you should
ensure that it remains a valid mime message.

Is it not valid mime to have text/plain parts without a charset parameter?

Cheers,
Uday

Julian Bradfield

2012-01-22 12:34:59 UTC

Post by Uday Reddy
My idea is not to transcode in anything. Just decode it to Emacs internal
coding and leave it at that. The folder can be saved to disk using whatever

That is transcoding...

Post by Uday Reddy
coding Emacs or the users chooses. It could be an ISO coding, for instance.
The coding used to save the folder is not VM's concern.

Only if, as you suggested, you prohibit any manipulation of the folder
file outside VM.
Suppose you use ISO2022. That's a stateful encoding, so if you have,
say, two Japanese messages in a row, if you pull out the second one by
itself from the disk file, you won't know that it's Japanese.
What will you do with win1252 messages, a character/coding set that
has no internal representation, even in the Emacs super-ISO
(ESC-quoted) coding system? You will, by necessity, transcode them to
a standard set.
Or character sets that exist, but which the Emacs installation you're
currently using doesn't happen to have defined? (For example, I don't
have the legacy Thai or Indian systems defined in my Emacs, because I
need the limited charset space for other things.)

Another trouble is that Emacs (certainly XEmacs, but I think also Emacs)
is not robust wrt character encodings. Generally, you can lose data
without being aware of it, because (de/en)coding happens so low down
that there's no easy way to return an error.
For example, half a year's worth of my non-English mail is
unrecoverably mangled by one of the many VM coding-system problems
I've fixed over the years - because I got no warning at the time that
anything was going wrong. (Now I've patched my XEmacs to at least warn
of trouble.)

Post by Uday Reddy

Post by Julian Bradfield
Why no charset headers? If you're munging a mime message, you should
ensure that it remains a valid mime message.

Is it not valid mime to have text/plain parts without a charset parameter?

Oh, it's valid - but a missing charset parameter MUST be treated as
us-ascii.

I'm not saying your scheme is impossible - or even difficult to do -
just that there are an awful lot of things to be very careful about.

Julian Bradfield

2012-01-19 11:23:31 UTC

Post by Uday Reddy
Perhaps we should have a way of marking messages based on selectors so that
we can bypass the creation of a virtual folder.

You mean M C ??
It's there in 8.1.1...

Uday Reddy

2012-01-19 11:46:42 UTC

Post by Julian Bradfield
You mean M C ??
It's there in 8.1.1...

Gosh, the function is called `vm-mark-matching-messages' and I would have
never thought "matching" means matching a virtual folder selector!

Would people mind if I rename it to `vm-mark-messages-by-selector'?

Cheers,
Uday

John Hein

2012-01-19 13:59:48 UTC

Post by Uday Reddy

Post by Julian Bradfield
You mean M C ??
It's there in 8.1.1...

Gosh, the function is called `vm-mark-matching-messages' and I would have
never thought "matching" means matching a virtual folder selector!
Would people mind if I rename it to `vm-mark-messages-by-selector'?

I don't mind. As a user, I only know it by 'M C <selector>' anyway.

Uday Reddy

2012-01-19 09:05:57 UTC

Post by John Hein
M-s is a wonderful search tool for a vm folder, but it searches the
encoded mime (i.e., the gobbledy-gook) instead of the decoded mime.
Given the ever increasing [it seems to me] usage of base64 even for
plain text messages (particularly from certain mobile devices), I
wonder how hard it would be to update the search to decode mime as it
chugs along. It's fairly rare that I need to search for a string in
encoded mime ;)

M-s is just an interface to the Emacs search engine. So it is quite
unintelligent.

V C text is the better way to go. We also need to get V C text to work with
external search engines (IMAP servers and mairix etc.)

We will eventually recode M-s to work with virtual folders. That will make
it a bit more useful.

Cheers,
Uday

John Hein

2012-01-19 14:22:18 UTC

Post by Uday Reddy

Post by John Hein
M-s is a wonderful search tool for a vm folder, but it searches the
encoded mime (i.e., the gobbledy-gook) instead of the decoded mime.
Given the ever increasing [it seems to me] usage of base64 even for
plain text messages (particularly from certain mobile devices), I
wonder how hard it would be to update the search to decode mime as it
chugs along. It's fairly rare that I need to search for a string in
encoded mime ;)

M-s is just an interface to the Emacs search engine. So it is quite
unintelligent.
V C text is the better way to go. We also need to get V C text to work with
external search engines (IMAP servers and mairix etc.)
We will eventually recode M-s to work with virtual folders. That will make
it a bit more useful.

Yes, V C text is helpful. Good reminder. It doesn't always work right
for me, but I'll have to dig into that separately. And I often
need regex searches which it seems V C text doesn't give.

I still like the incremental highlighted search, with visible context
(and regex and case insensitivity), I get with M-S.

39 Replies
2 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

John Hein 2012-01-18 23:18:39 UTC

r***@knighten.org 2012-01-19 08:06:43 UTC

Uday Reddy 2012-01-19 09:08:09 UTC

r***@knighten.org 2012-01-19 09:34:16 UTC

John Hein 2012-01-19 14:14:08 UTC

Julian Bradfield 2012-01-19 14:54:32 UTC

John Hein 2012-01-19 16:44:44 UTC

Uday Reddy 2012-01-19 17:12:12 UTC

John Hein 2012-01-19 22:27:27 UTC

Uday Reddy 2012-01-19 15:51:15 UTC

Mark Diekhans 2012-01-19 17:38:07 UTC

Uday Reddy 2012-01-19 18:17:27 UTC

Mark Diekhans 2012-01-21 22:57:50 UTC

Uday Reddy 2012-01-21 23:35:38 UTC

m***@kermodei.com 2012-01-28 21:21:51 UTC

Uday Reddy 2012-01-30 17:56:16 UTC

John Hein 2012-01-19 18:42:56 UTC

Uday Reddy 2012-01-19 19:14:56 UTC

Julian Bradfield 2012-01-19 22:06:39 UTC

John Hein 2012-01-19 23:04:26 UTC

Julian Bradfield 2012-01-20 08:10:00 UTC

John Hein 2012-01-20 19:44:12 UTC

Julian Bradfield 2012-01-20 20:23:30 UTC

John Hein 2012-01-21 02:28:07 UTC

Julian Bradfield 2012-01-21 09:50:05 UTC

Uday S Reddy 2012-01-21 10:05:59 UTC

Uday Reddy 2012-01-20 01:36:13 UTC

Uday Reddy 2012-01-20 01:54:49 UTC

Ulrich Mueller 2012-01-20 07:07:52 UTC

Uday S Reddy 2012-01-20 23:39:50 UTC

Julian Bradfield 2012-01-20 07:26:35 UTC

Uday Reddy 2012-01-21 09:53:14 UTC

Julian Bradfield 2012-01-21 11:55:03 UTC

Uday Reddy 2012-01-21 22:34:21 UTC

Julian Bradfield 2012-01-22 12:34:59 UTC

Julian Bradfield 2012-01-19 11:23:31 UTC

Uday Reddy 2012-01-19 11:46:42 UTC

John Hein 2012-01-19 13:59:48 UTC

Uday Reddy 2012-01-19 09:05:57 UTC

John Hein 2012-01-19 14:22:18 UTC

about - legalese

Loading...