Sat Nov 22 08:16:48 GMT 2008
Adding an option to exclude a user from top
Over the years I've found that I've wanted to exclude a user from top (most often 'root' on a fairly quiet system so I can easily see what other user processes are up to even though they're being quiet).
I've written a patch for the top
in
procps which adds j/J
command line and run-time options which essentially mirror the u/U
filters.
I've sent this in to the procps maintainers just in case they want to include it in the trunk.
The patch is here: procps-3.2.7-exclude-user-patch.
The full source bundle is here: procps-3.2.7-exclude-user.tgz.
Sat Nov 1 03:20:05 GMT 2008
Converting ISO-8859-1 filenames to UTF-8
While falling asleep at the keyboard I thought I'd leave a process running overnight to process a bunch of my old files, only to find a series of errors popping up telling me that there were unexpected characters in the filenames.
A quick look revealed that I had ended up with some ISO-8859-1 (ie Latin-1) character encodings within filenames, which was all very well before I switched to UTF-8!.
Contemplating possible temporary resetting of my locale (but what about my newer files that include UTF-8 encodings?) or some sort of simple filter that dropped characters I chanced upon Unicode Tools and a neat list of tools available in various languages.
In this case perl's Encode came to the rescue with this rather short and nicely effective script:
1 #!/usr/bin/env perl 2 3 # 4 # For each filepath given on stdin, convert the filename from 5 # iso-8859-1 to utf-8 (or try to!) 6 # 7 8 use strict; 9 use warnings; 10 use Encode; 11 12 while ( my $orig_filepath = <STDIN> ) { 13 if ( $orig_filepath =~ /[\xA0-\xFF]/ ) { 14 chomp $orig_filepath; 15 my $new_filepath = $orig_filepath; 16 Encode::from_to($new_filepath, 'iso-8859-1', 'utf-8') 17 or die $!; 18 print "$orig_filepath ---> $new_filepath\n"; 19 rename $orig_filepath, $new_filepath; 20 } 21 }
The check for characters between 0xA0
and 0xFF
is
far from perfect for general use (indeed differenciating between the various
ISO-8859-*'s would be a bit of a nightmare) but as I knew my encodings were
limited to ISO-8859-1 I got away with a bit of a shortcut there.
Comment by: Bradley Dean at Sat, 1 Nov 2008 09:44:11 (BST)
Of course being somewhat late at night I missed the obvious blunder
in the logic - there are characters in the range 0xA0-0xFF
in UTF-8 encoded strings so rerunning can cause a bit of a mess.
I'm still thinking about this - the UTF-8 multibyte marker byte is
in the range 0xC0
to 0xFD
. These are all legal
ISO-8859-1 character codes so I think I need to draw some sets out on a
piece of paper and find the mutually exclusive bits.
Comment by: Bradley Dean at Sun, 2 Nov 2008 00:33:40 (BST)
Given that UTF-8 character bytes for non-ASCII characters fall within
the range 0x80
to 0xFD
, and given that this overlaps
with the character bytes used by ISO-8859-1, there's not a way to differenciate
a high-byte string just by looking at individual bytes.
In my scenario (text will either be plain ASCII, ISO-8859-1 or UTF-8) the best test seems to be to try and interpret the string as UTF-8. If errors occur it's going to be ISO-8859-1, if not it's going to be UTF-8.