Wed Apr 23 01:49:42 BST 2008

Playing around with FUSE

I rember reading about FUSE (Filesystem in Userspace) a while ago and thinking it was a pretty groovy idea without having any particular need for it at the time.

Essentially, FUSE brings filesystem implementation to the masses - where 'the masses' in this case are developers who haven't yet leaped into kernel development. With lots of language bindings available FUSE also makes filesystem development somewhat less painful - where I'm defining pain in this case as the fun and games of c development. :).

I started thinking about FUSE again just the other day when I noticed that ubuntu is now mounting NTFS volumes read/write using NTFS-3G (a FUSE-based driver).

As FUSE becomes more popular this opens up lots of interesting possibilities - support has been added for MacOS and FreeBSD and it sounds like there may be one or two ms windows versions in the works. If this were to happen we might finally have a nice way to mount useful filesystems under ms windows (FAT32 retirement being well overdue, and now somewhat necessary given that DVD images and other large files wont fit).

What's tickling my fancy at the moment:

  • Something steganogrpahic. There's already plenty of serious implementations of this but it feels like a meaty enough problem to be interesting to implement. Probabaly something simple like storing the data in images or music or some such thing.
  • Slightly sillier - and with little to no useful application - I think it could be fun to give the old pick-a-path/choose-your-own-adventure style text-game a filesystem interface.
  • Given my recent developments with Amazon S3 it could also be fun to look at a filesystem view. Given that transferring massive blobs of data back and forth isn't very reliable this could be quite interesting as files would need to be chopped up inside the S3 storage but appear as single entries in the filesystem.

Final comments - I thought I'd start out with a cute little filesystem which was infinitely deep, containing random entries (thus avoiding the time-consuming part of defining a sensible traversible structure). The first thing to strike me is that exceptions vanish without a trace (somewhere in the internals of FUSE/Kernel space).

I'm finding it valuable to wrap FUSE methods in exception catching (logging all errors to a file) - something like:

1    def readdir(self, path, offset):
2      try:
3        self.log('readdir(%s)' % path)
4        for r in  '.', '..', str(random.randint(1,1000)):
5          yield fuse.Direntry(r)
6      except Exception, e:
7        self.log('!!!EXCEPTION THROWN!!! %s' % e)
8        raise e

This way pesky little exceptions are at least identifiable which makes fixing them a bit easier!


Comment by: Bradley Dean at Wed, 23 Apr 2008 18:03:55 (BST)

Szaka (lead developer of NTFS-3G) left a comment (below). I should perhaps clarify my earlier statement to say the FUSE is also extremely useful for developers who do work at the kernel level but have now got a low coding cost alternative!


Comment by: Szaka (ntfs-3g.org) at Wed, 23 Apr 2008 14:06:50 +0300 (EEST)

> Essentially, FUSE brings filesystem implementation to the masses - where 
> 'the masses' in this case are developers who haven't yet leaped into 
> kernel development. 

While this is definitely true, exactly the opposite happened with NTFS-3G. 
We always developed the NTFS kernel driver but the breakthrough came when 
we moved to FUSE. 

Of of the explanations is that NTFS is really huge. The Microsoft NTFS 
river is over 500,000 source lines we must be compatible with. It's almost 
more than all the 60+ in-kernel file systems altogether(!!).

Since the kernel/VFS is rapidly changing we had to spend most of our time 
adapting the huge NTFS code base and supporting backward compatibility. 

The popular Linux in-kernel file systems are sponsored by several big 
companies and many developers work on it as their main job. But NTFS 
weren't sponsored, only very few of us worked on it in our spare time. 
So, the time is __extremely__ critical for us (huge amount of, hard work 
must be done highly efficiently).

FUSE is small and let us isolate from the kernel, so we can fully focus on 
NTFS (which is not entire true because there are several related kernel 
reliability and performance issues being worked on which affects also other 
subsystems and user space application).

Kernel experience helps to write efficient FUSE file system but we try to 
make FUSE perform the best out of box.  FUSE is new and rapidly keeps 
improving.

Regards,
	    Szaka
--
NTFS-3G:  http://ntfs-3g.org

Posted by Bradley Dean | Permalink | Categories: Python, Programming

Fri Mar 28 00:22:03 GMT 2008

AmazonS3Store 1.1 - Handle truncated responses

Something I should probably have noticed before but Amazon S3 list_bucket responses can be truncated if theres a lot of entries in a bucket.

If truncation has occurred the is_truncated attribute is set to be true on the response.

Follow-on requests can be made by setting the marker option to the call to list_bucket to be the key of the last entry received:

1  options = { 'marker' : list_bucket_response.entries[-1].key }
2  list_bucket_response = conn.list_bucket(bucket, options)

Here's the release of AmazonS3Store 1.1 with the changes: AmazonS3Store-1.1.tar.gz.

Version 1.1 - Fri Mar 28 00:02:04 GMT 2008
 -----------------------------------------
 * Added internal function __list_bucket_iter(cmdopts, cfg, conn)
   which wraps the connection list_bucket. The function is an iterator and
   makes multiple requests where truncated responses are detected.

Posted by Bradley Dean | Permalink | Categories: Python, Programming

Sun Mar 16 11:33:17 GMT 2008

pyprove 1.3 - __import__ compatability between python versions 2.4 and 2.5

__init__ changed between versions 2.4 and 2.5 of python - in 2.5 keyword arguments are available.

I've just used the old-style call which is supported in 2.5:

1  -    imported_module = __import__(module_fqname, fromlist = [1])
2  +    imported_module = __import__(module_fqname, globals(), locals(), [1])
Version 1.3 - Sun Mar 16 11:24:14 GMT 2008
 -----------------------------------------
 * Changed calls to __import__ to be compatible between python versions
   2.4 and 2.5 (in 2.5 __import__ supports keyword arguments)

pyprove-1.3.tar.gz


Posted by Bradley Dean | Permalink | Categories: Python, Programming

Fri Mar 14 22:04:01 GMT 2008

pyprove 1.2 - 'has no tests' bugfix

And now another release for a bugfix:

Version 1.2 - Fri Mar 14 21:57:01 GMT 2008
 -----------------------------------------
 * When doctest.DocTestSuite throws a ValueError (has no tests)
   recover from the exception and move on

pyprove-1.2.tar.gz


Posted by Bradley Dean | Permalink | Categories: Python, Programming

Fri Mar 14 19:52:34 GMT 2008

pyprove 1.1 - better search path configuration

A full several minutes between the first and second release, not too bad...

As per the changelog:

Version 1.1 - Fri Mar 14 19:36:15 GMT 2008
 -----------------------------------------
 * Change appending of library_prefix_paths entries to
   inserting them at the beginning of sys.path to make sure
   local source is found before source installed other places
   in the path

 * Pre-pend os.getcwd() to sys.path to help finding libraries
   in the right place without having to set PYTHONPATH

Here it is: pyprove-1.1.tar.gz


Posted by Bradley Dean | Permalink | Categories: Python, Programming

Fri Mar 14 19:20:24 GMT 2008

pyprove - recursively run tests like perl prove

I know this exists in various guises around the place but I've been coming back to wanting something easy to use like perl prove when writing python.

A while back I was doing some work on Zope and they have test.py which does just that - but it's somewhat specific to Zope.

So I've sat down and come up with pyprove - the idea being that it should be possible to run the script within a source code tree and have it go out and find unittest and doctest tests and run them all.

Thus far it's pretty simple - no command line options are available and it only runs from the current directory. It also requires that PYTHONPATH is set if tested modules are not in the current search path.

Here's a source distribution: pyprove-1.0.tar.gz.

(Debugging is greatly facilitated by re-including the commented out logging configuration line.)


Posted by Bradley Dean | Permalink | Categories: Python, Programming

Mon Feb 25 01:21:58 GMT 2008

Bundling the Amazon S3 code with distutils

Brief note - I've bundled up the Amazon S3 code I've been working on with python distutils.

This is available here: AmazonS3Store-1.0.tar.gz


Posted by Bradley Dean | Permalink | Categories: Python, Programming

Fri Feb 22 23:50:40 GMT 2008

Tweaking and bundling the Amazon S3 tools

With a little bit of tweaking I've now used s3store.py to push a tarball of an entire system up to Amazon S3 which means I've now got this code to the point I needed it.

I've bundled up the code into a tarball: amazon-s3-20080222.tbz

The contents of the tarball are:

S3.py
The Amazon S3 Library for REST in Python
backup.py
Walk directory tree storing regular files to Amazon S3
clear_environment.py
Delete everything stored in Amazon S3
s3config.cfg
Confifguration file template
s3store.py
read, write, list and delete piped data in/out of Amazon S3
s3storelib.py
module supporting s3store.py
system_backup.sh
Store a tarball built from / to Amazon S3

The main change made to the base library (s3storelib.py) was to include an error-and-retry on writing data to S3:

 1  Index: s3storelib.py
 2  ===================================================================
 3  --- s3storelib.py       (revision 18)
 4  +++ s3storelib.py       (revision 20)
 5  @@ -8,6 +8,9 @@
 6   # Data chunk size
 7   chunk_size = 10 * 1024 * 1024
 8
 9  +# Maximum number of times to retry S3 calls
10  +max_tries = 5
11  +
12   def usage():
13     print "\n".join([
14         'Usage:'
15  @@ -134,8 +137,19 @@
16       s3_chunk = '%s-%010d' % (tag, counter)
17       print "Uploading chunk: ", s3_chunk,
18       sys.stdout.flush()
19  -    resp = conn.put(bucket, s3_chunk, chunk)
20  -    assert resp.http_response.status == 200, resp.message
21  +    tries = 0
22  +    while True:
23  +      try:
24  +        resp = conn.put(bucket, s3_chunk, chunk)
25  +        assert resp.http_response.status == 200, resp.message
26  +      except Exception, e:
27  +        tries += 1
28  +        if tries > max_tries:
29  +          raise Exception( "Too many failures: " + str(e) )
30  +        print "[RETRY]",
31  +      else:
32  +        # It worked, break the loop
33  +        break
34       print "[DONE]"
35       sys.stdout.flush()
36       counter += 1

Sometimes connections seems to fail, but in general when I've gone looking for them they haven't happened so a retry-loop seemed a reasonable approach.


Posted by Bradley Dean | Permalink | Categories: Python, Programming

Fri Feb 22 01:14:13 GMT 2008

Revisting Amazon S3 - Piping data into S3

So, the other day I was playing around with storing and deleting content on Amazon Simple Storage Service (S3).

At the time I threw together a quick backup script which walked a local directory tree and attempted to push files up to S3. I also noted that there were lots of limitations to that approach - one of the main ones being that this really just stored regular files so everything else was left behind (symlinks, empty directories etc.)

It occurred to me that we already have perfectly good tools for packaging files together (tar for instance). The problem was that to use these tools I needed disk space to store the output.

This problem has already been solved - if you need to tar up a set of files onto another computer you can simply tar to stdout and pipe that through a ssh connection:

1  $ tar cjBf - /source/dir | ssh host "cat > file.tbz"

So what I really needed was to be able to pipe data into S3, something akin to splitting a file, which in turn can be expressed very simply in python with something like (split.py):

 1  #!/usr/bin/env python
 2
 3  import sys
 4
 5  chunk_size   = int(sys.argv[1])
 6  split_prefix = sys.argv[2]
 7  counter      = 0
 8
 9  chunk = sys.stdin.read(chunk_size)
10  while len(chunk) > 0:
11    fh = open("%s-%05d" % (split_prefix, counter), "wb")
12    fh.write(chunk)
13    fh.close()
14    chunk = sys.stdin.read(chunk_size)
15    counter += 1

And that works - because all I need to do is replace those file writes with S3 RESTful PUTs and I'm there:

 1  def write_data(cmdopts, cfg, conn):
 2    """ Read data from STDIN and store it to Amazon S3
 3
 4        Exceptions will be raised for non-recoverable errors
 5    """
 6    bucket  = cfg.get('Bucket', 'id')
 7    tag     = cmdopts['tag']
 8    counter = 0
 9
10    chunk = sys.stdin.read(chunk_size)
11    while len(chunk) > 0:
12      s3_chunk = '%s-%010d' % (tag, counter)
13      print "Uploading chunk: ", s3_chunk,
14      sys.stdout.flush()
15      resp = conn.put(bucket, s3_chunk, chunk)
16      assert resp.http_response.status == 200, resp.message
17      print "[DONE]"
18      sys.stdout.flush()
19      counter += 1
20      chunk = sys.stdin.read(chunk_size)

And if I can do that, then I should be able reverse the process and read my S3 content back through a pipe using something like:

 1  def read_data(cmdopts, cfg, conn):
 2    """ Read data from STDIN and store it to Amazon S3
 3    """
 4    bucket  = cfg.get('Bucket', 'id')
 5    tag     = cmdopts['tag']
 6
 7    assert bucket in [x.name for x in conn.list_all_my_buckets().entries]
 8
 9    for name in [x.key for x in conn.list_bucket(bucket).entries]:
10      if ( name[:len(tag)+1] == '%s-' % tag ):
11        data = conn.get(bucket, name)
12        sys.stdout.write(data.object.data)

And because that's all feeling fairly useful I've wrapped it up in a little more code which makes things easy:

s3store.py
The main script - reads and writes from S3
s3storelib.py
Library for s3store.py

And finally - an example of using the script:

In...

 1  $ tar czf - /path/to/something | ./s3store.py -w -t bob
 2  tar: Removing leading `/' from member names
 3  Deleting old data:  bob-0000000000  [DONE]
 4  Deleting old data:  bob-0000000001  [DONE]
 5  Deleting old data:  bob-0000000002  [DONE]
 6  Uploading chunk:  bob-0000000000 [DONE]
 7  Uploading chunk:  bob-0000000001 [DONE]
 8  Uploading chunk:  bob-0000000002 [DONE]
 9  .
10  .
11  .

And out...

1  $ ./s3store.py -r -t bob | tar tzf -
2  path/to/something/
3  path/to/something/a/
4  path/to/something/a/file.txt
5  path/to/something/b_file.txt
6  .
7  .
8  .

The script uses a python ConfigParser configuration script which looks like this:

[Credentials]
aws_access_key_id: XXXXXXXXXXXXXXXXXXXX
aws_secret_access_key: YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY

[Bucket]
id: data-backup

Posted by Bradley Dean | Permalink | Categories: Python, Programming

Tue Feb 19 00:21:51 GMT 2008

Connecting to Amazon S3 with Python

Every now and then I've looked at and discussed the various Amazon Web Services but have never actually got around to using any og them personally.

I still don't really need a dynamically and automatically scalable cluster cloud of virtual computers at my beck and call - however groovy that might seem.

What I do need at the moment is some extra storage space - both as a place to backup some data reliably and as a place to serve larger chunks of data which have a tendency to fill up the brilliant but not exactly storage-heavy VPS hosting solutions available these days.

Amazon Simple Storage Service (S3) to the rescue. S3 provides cheap storage via both REST and SOAP interfaces.

Better yet there's libraries already available in a number of languages - information and documentation is available at the Amazon S3 Community Code site.

In this case I'm using the Amazon S3 Library for REST in Python.

So what can I do with this?

Here's a rudimentary backup script: ( backup.py):

 1  #!/usr/bin/env python
 2
 3  import os
 4  import os.path
 5  import sys
 6  import time
 7
 8  import S3
 9
10  AWS_ACCESS_KEY_ID = 'XXXXXXXXXXXXXXXXXXXX'
11  AWS_SECRET_ACCESS_KEY = 'YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY'
12  conn = S3.AWSAuthConnection(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
13
14  time_stamp = time.strftime("%Y%m%d-%H%M%S")
15  backup_bucket = "backup"
16
17  print "Storing in %s [%s]" % (backup_bucket, time_stamp),
18  resp = conn.create_bucket(backup_bucket)
19  print resp.message
20
21  for base_dir in sys.argv:
22    print base_dir
23    for root, dirs, files in os.walk(base_dir):
24      print root
25      for file in files:
26        file_path = os.path.join(root, file)
27        fh = open(file_path, 'rb')
28        data = fh.read()
29        fh.close()
30
31        backup_path = os.path.join(time_stamp, file_path.lstrip('/'))
32        print " .. %s" % backup_path,
33        resp = conn.put(backup_bucket, backup_path, data)
34        print " [%s]" % resp.message

This will walk through a given set of directories and try and upload all regular files it finds. Note that no handling exists for failed uploads (I did say rudimentary) or for non-regular files like symlinks.

I suppose the easiest or most reliable way to make this work across all file types would be to just backup tarballs - on the other hand that means that I need to have the space to store the tarball which somewhat defeats the purpose of cheaper storage.

Having gone and pushed a whole lot of stuff into my S3 space I may as well delete it (if for nothing else then for an excercise in walking through the S3 contents).

So, here's a deletion script: ( clear_environment.py)

 1  #!/usr/bin/env python
 2
 3  import S3
 4
 5  AWS_ACCESS_KEY_ID = 'XXXXXXXXXXXXXXXXXXXX'
 6  AWS_SECRET_ACCESS_KEY = 'YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY'
 7  conn = S3.AWSAuthConnection(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
 8
 9  for bucket in conn.list_all_my_buckets().entries:
10    print bucket.name.encode('ascii', 'replace')
11    for item in conn.list_bucket(bucket.name).entries:
12      print " .. %s" % item.key.encode('ascii', 'replace'),
13      conn.delete(bucket.name, item.key)
14      print " [DELETED]"
15    conn.delete_bucket(bucket.name)
16    print "Deleted bucket"

Probably the main thing to note here is that Amazon S3 does not store objects in a hierarchy. There are a number of base level buckets (in this case named 'backup') which are then just filled up with uniquely keyed items.

A convention among various S3 file-storage/backup solutions has been to name this key using a unix-style path structure. If one of these solutions were to access the files stored by the backup script above they would allow navigation by 'directory' even though no actual directories existed.


Posted by Bradley Dean | Permalink | Categories: Python, Programming

Tuesday 12th February, 16:11:51 GMT

Sending ETRN to backup SMTP exchanges

Having played around with checking SMTP services for backup MX exchanges ("Testing SMTP exchanges") I then thought it would be useful to be able to easily trigger ETRN requests. Backup MX servers tend to poll the mail server periodically to do this automatically but being impatient...

Again using smtplib, this is even quicker and easier then the testing script:

 1  #!/usr/bin/env python
 2
 3  import smtplib
 4
 5  backup_servers = { 'mx3.zoneedit.com' : [ 'bjdean.id.au'
 6                                          , 'orientgrove.net'
 7                                          ]
 8                   }
 9
10  if __name__ == '__main__':
11    for backup_mx in backup_servers.keys():
12      print ">>> Connecting to", backup_mx
13      server = smtplib.SMTP(backup_mx)
14      #server.set_debuglevel(1)
15      for domain in backup_servers[backup_mx]:
16        print ">>> >>> ETRN domain", domain
17        server.docmd('ETRN', domain)

And here's what I see (with debugging turned back on):

>>> Connecting to mx3.zoneedit.com
>>> >>> ETRN domain bjdean.id.au
send: 'ETRN bjdean.id.au\r\n'
reply: '250 Queuing started\r\n'
reply: retcode (250); Msg: Queuing started
>>> >>> ETRN domain orientgrove.net
send: 'ETRN orientgrove.net\r\n'
reply: '250 Queuing started\r\n'
reply: retcode (250); Msg: Queuing started

Posted by Bradley Dean | Permalink | Categories: Python, Programming

Tuesday 12th February, 11:21:14 GMT

Testing SMTP exchanges

With a few domains in tow, and a few different live and backup MX exchanges attached to those, I needed a quick way to work out what was working and what wasn't.

dnspython and smtplib make for a very quick script which tells me everything I need to know.

With a few quick code adjustments I can disect the failures or view the complete SMTP transcript - particularly handy if I'm discussing issues with up-stream providers.

Here's the code:

 1  #!/usr/bin/env python
 2
 3  import smtplib
 4  import dns.resolver
 5
 6  domains = [ 'mydomain.id.au'
 7            , 'myotherdomain.org'
 8            , 'bjdean.id.au'
 9            ]
10
11  def test_domain(domain):
12    print "Testing", domain
13
14    for server in dns.resolver.query(domain, 'MX'):
15      test_smtp(domain, str(server.exchange).strip('.'))
16
17  def test_smtp(domain, exchange):
18    print "Sending test message via exchange", exchange
19    fromaddr = "test_smtp_servers-FROM@%s" % (domain)
20    toaddr   = "test_smtp_servers-TO@%s" % (domain)
21    subject  = "Test via %s for %s" % (exchange, domain)
22    msg = "From: "    + fromaddr + "\r\n" \
23        + "To: "      + toaddr   + "\r\n" \
24        + "Subject: " + subject  + "\r\n" \
25        + "\r\n\r\n"                      \
26        + subject
27
28    server = smtplib.SMTP(exchange)
29    #server.set_debuglevel(1)
30    try:
31      server.sendmail(fromaddr, toaddr, msg)
32    except Exception, e:
33      print "EXCHANGE FAILED:", e
34      #import pdb; pdb.set_trace()
35    server.quit()
36
37  if __name__ == '__main__':
38    for domain in domains:
39      test_domain(domain)

And here's what I see:

Testing mydomain.id.au
Sending test message via exchange mx1.mydomain.id.au
Sending test message via exchange mx2.mydomain.id.au
Testing myotherdomain.org
Sending test message via exchange myotherdomain.org
Testing bjdean.id.au
Sending test message via exchange mail.bjdean.id.au
Sending test message via exchange mx2.zoneedit.com
EXCHANGE FAILED: {'test_smtp_servers-TO@bjdean.id.au': (554, '5.7.1 : Relay access denied')}

Posted by Bradley Dean | Permalink | Categories: Python, Programming