Brad's Blog: November 2008 Archives

Sat Nov 22 08:16:48 GMT 2008

Adding an option to exclude a user from top

Over the years I've found that I've wanted to exclude a user from top (most often 'root' on a fairly quiet system so I can easily see what other user processes are up to even though they're being quiet).

I've written a patch for the top in procps which adds j/J command line and run-time options which essentially mirror the u/U filters.

I've sent this in to the procps maintainers just in case they want to include it in the trunk.

The patch is here: procps-3.2.7-exclude-user-patch.

The full source bundle is here: procps-3.2.7-exclude-user.tgz.

Posted by Bradley Dean | Permalink

Sat Nov 1 03:20:05 GMT 2008

Converting ISO-8859-1 filenames to UTF-8

While falling asleep at the keyboard I thought I'd leave a process running overnight to process a bunch of my old files, only to find a series of errors popping up telling me that there were unexpected characters in the filenames.

A quick look revealed that I had ended up with some ISO-8859-1 (ie Latin-1) character encodings within filenames, which was all very well before I switched to UTF-8!.

Contemplating possible temporary resetting of my locale (but what about my newer files that include UTF-8 encodings?) or some sort of simple filter that dropped characters I chanced upon Unicode Tools and a neat list of tools available in various languages.

In this case perl's Encode came to the rescue with this rather short and nicely effective script:

 1  #!/usr/bin/env perl
 2
 3  #
 4  # For each filepath given on stdin, convert the filename from
 5  # iso-8859-1 to utf-8 (or try to!)
 6  #
 7
 8  use strict;
 9  use warnings;
10  use Encode;
11
12  while ( my $orig_filepath = <STDIN> ) {
13    if ( $orig_filepath =~ /[\xA0-\xFF]/ ) {
14      chomp $orig_filepath;
15      my $new_filepath = $orig_filepath;
16      Encode::from_to($new_filepath, 'iso-8859-1', 'utf-8')
17        or die $!;
18      print "$orig_filepath ---> $new_filepath\n";
19      rename $orig_filepath, $new_filepath;
20    }
21  }

The check for characters between 0xA0 and 0xFF is far from perfect for general use (indeed differenciating between the various ISO-8859-*'s would be a bit of a nightmare) but as I knew my encodings were limited to ISO-8859-1 I got away with a bit of a shortcut there.

Comment by: Bradley Dean at Sat, 1 Nov 2008 09:44:11 (BST)

Of course being somewhat late at night I missed the obvious blunder in the logic - there are characters in the range 0xA0-0xFF in UTF-8 encoded strings so rerunning can cause a bit of a mess.

I'm still thinking about this - the UTF-8 multibyte marker byte is in the range 0xC0 to 0xFD. These are all legal ISO-8859-1 character codes so I think I need to draw some sets out on a piece of paper and find the mutually exclusive bits.

Comment by: Bradley Dean at Sun, 2 Nov 2008 00:33:40 (BST)

Given that UTF-8 character bytes for non-ASCII characters fall within the range 0x80 to 0xFD, and given that this overlaps with the character bytes used by ISO-8859-1, there's not a way to differenciate a high-byte string just by looking at individual bytes.

In my scenario (text will either be plain ASCII, ISO-8859-1 or UTF-8) the best test seems to be to try and interpret the string as UTF-8. If errors occur it's going to be ISO-8859-1, if not it's going to be UTF-8.

Posted by Bradley Dean | Permalink