I rember reading about FUSE (Filesystem in Userspace) a while ago and thinking it was a pretty groovy idea without having any particular need for it at the time.
Essentially, FUSE brings filesystem implementation to the masses - where
'the masses' in this case are developers who haven't yet leaped into kernel
development. With lots of
language bindings available FUSE also makes filesystem development
somewhat less painful - where I'm defining pain in this case as the
fun and games of c development. :)
.
I started thinking about FUSE again just the other day when I noticed that ubuntu is now mounting NTFS volumes read/write using NTFS-3G (a FUSE-based driver).
As FUSE becomes more popular this opens up lots of interesting possibilities - support has been added for MacOS and FreeBSD and it sounds like there may be one or two ms windows versions in the works. If this were to happen we might finally have a nice way to mount useful filesystems under ms windows (FAT32 retirement being well overdue, and now somewhat necessary given that DVD images and other large files wont fit).
What's tickling my fancy at the moment:
Final comments - I thought I'd start out with a cute little filesystem which was infinitely deep, containing random entries (thus avoiding the time-consuming part of defining a sensible traversible structure). The first thing to strike me is that exceptions vanish without a trace (somewhere in the internals of FUSE/Kernel space).
I'm finding it valuable to wrap FUSE methods in exception catching (logging all errors to a file) - something like:
1 def readdir(self, path, offset): 2 try: 3 self.log('readdir(%s)' % path) 4 for r in '.', '..', str(random.randint(1,1000)): 5 yield fuse.Direntry(r) 6 except Exception, e: 7 self.log('!!!EXCEPTION THROWN!!! %s' % e) 8 raise e
This way pesky little exceptions are at least identifiable which makes fixing them a bit easier!
Comment by: Bradley Dean at Wed, 23 Apr 2008 18:03:55 (BST)
Szaka (lead developer of NTFS-3G) left a comment (below). I should perhaps clarify my earlier statement to say the FUSE is also extremely useful for developers who do work at the kernel level but have now got a low coding cost alternative!
Comment by: Szaka (ntfs-3g.org) at Wed, 23 Apr 2008 14:06:50 +0300 (EEST)
> Essentially, FUSE brings filesystem implementation to the masses - where > 'the masses' in this case are developers who haven't yet leaped into > kernel development. While this is definitely true, exactly the opposite happened with NTFS-3G. We always developed the NTFS kernel driver but the breakthrough came when we moved to FUSE. Of of the explanations is that NTFS is really huge. The Microsoft NTFS river is over 500,000 source lines we must be compatible with. It's almost more than all the 60+ in-kernel file systems altogether(!!). Since the kernel/VFS is rapidly changing we had to spend most of our time adapting the huge NTFS code base and supporting backward compatibility. The popular Linux in-kernel file systems are sponsored by several big companies and many developers work on it as their main job. But NTFS weren't sponsored, only very few of us worked on it in our spare time. So, the time is __extremely__ critical for us (huge amount of, hard work must be done highly efficiently). FUSE is small and let us isolate from the kernel, so we can fully focus on NTFS (which is not entire true because there are several related kernel reliability and performance issues being worked on which affects also other subsystems and user space application). Kernel experience helps to write efficient FUSE file system but we try to make FUSE perform the best out of box. FUSE is new and rapidly keeps improving. Regards, Szaka -- NTFS-3G: http://ntfs-3g.org
Something I should probably have noticed before but Amazon S3
list_bucket
responses can be truncated if theres a
lot of entries in a bucket.
If truncation has occurred the is_truncated
attribute
is set to be true on the response.
Follow-on requests can be made by setting the marker
option to the call to list_bucket
to be the key of the
last entry received:
1 options = { 'marker' : list_bucket_response.entries[-1].key } 2 list_bucket_response = conn.list_bucket(bucket, options)
Here's the release of AmazonS3Store 1.1 with the changes: AmazonS3Store-1.1.tar.gz.
Version 1.1 - Fri Mar 28 00:02:04 GMT 2008 ----------------------------------------- * Added internal function __list_bucket_iter(cmdopts, cfg, conn) which wraps the connection list_bucket. The function is an iterator and makes multiple requests where truncated responses are detected.
__init__
changed between versions 2.4 and 2.5 of python - in 2.5
keyword arguments are available.
I've just used the old-style call which is supported in 2.5:
1 - imported_module = __import__(module_fqname, fromlist = [1]) 2 + imported_module = __import__(module_fqname, globals(), locals(), [1])
Version 1.3 - Sun Mar 16 11:24:14 GMT 2008 ----------------------------------------- * Changed calls to __import__ to be compatible between python versions 2.4 and 2.5 (in 2.5 __import__ supports keyword arguments)
And now another release for a bugfix:
Version 1.2 - Fri Mar 14 21:57:01 GMT 2008 ----------------------------------------- * When doctest.DocTestSuite throws a ValueError (has no tests) recover from the exception and move on
A full several minutes between the first and second release, not too bad...
As per the changelog:
Version 1.1 - Fri Mar 14 19:36:15 GMT 2008 ----------------------------------------- * Change appending of library_prefix_paths entries to inserting them at the beginning of sys.path to make sure local source is found before source installed other places in the path * Pre-pend os.getcwd() to sys.path to help finding libraries in the right place without having to set PYTHONPATH
Here it is: pyprove-1.1.tar.gz
I know this exists in various guises around the place but I've been coming back to wanting something easy to use like perl prove when writing python.
A while back I was doing some work on Zope and they have test.py which does just that - but it's somewhat specific to Zope.
So I've sat down and come up with pyprove
- the idea being that
it should be possible to run the script within a source code tree and have it
go out and find
unittest and
doctest tests and
run them all.
Thus far it's pretty simple - no command line options are available and it
only runs from the current directory. It also requires that
PYTHONPATH
is set if tested modules are not in the current
search path.
Here's a source distribution: pyprove-1.0.tar.gz.
(Debugging is greatly facilitated by re-including the commented out logging configuration line.)
Brief note - I've bundled up the Amazon S3 code I've been working on with python distutils.
This is available here: AmazonS3Store-1.0.tar.gz
With a little bit of tweaking I've now used s3store.py
to push
a tarball of an entire system up to Amazon S3 which means I've now got this
code to the point I needed it.
I've bundled up the code into a tarball: amazon-s3-20080222.tbz
The contents of the tarball are:
The main change made to the base library (s3storelib.py
) was
to include an error-and-retry on writing data to S3:
1 Index: s3storelib.py 2 =================================================================== 3 --- s3storelib.py (revision 18) 4 +++ s3storelib.py (revision 20) 5 @@ -8,6 +8,9 @@ 6 # Data chunk size 7 chunk_size = 10 * 1024 * 1024 8 9 +# Maximum number of times to retry S3 calls 10 +max_tries = 5 11 + 12 def usage(): 13 print "\n".join([ 14 'Usage:' 15 @@ -134,8 +137,19 @@ 16 s3_chunk = '%s-%010d' % (tag, counter) 17 print "Uploading chunk: ", s3_chunk, 18 sys.stdout.flush() 19 - resp = conn.put(bucket, s3_chunk, chunk) 20 - assert resp.http_response.status == 200, resp.message 21 + tries = 0 22 + while True: 23 + try: 24 + resp = conn.put(bucket, s3_chunk, chunk) 25 + assert resp.http_response.status == 200, resp.message 26 + except Exception, e: 27 + tries += 1 28 + if tries > max_tries: 29 + raise Exception( "Too many failures: " + str(e) ) 30 + print "[RETRY]", 31 + else: 32 + # It worked, break the loop 33 + break 34 print "[DONE]" 35 sys.stdout.flush() 36 counter += 1
Sometimes connections seems to fail, but in general when I've gone looking for them they haven't happened so a retry-loop seemed a reasonable approach.
So, the other day I was playing around with storing and deleting content on Amazon Simple Storage Service (S3).
At the time I threw together a quick backup script which walked a local directory tree and attempted to push files up to S3. I also noted that there were lots of limitations to that approach - one of the main ones being that this really just stored regular files so everything else was left behind (symlinks, empty directories etc.)
It occurred to me that we already have perfectly good tools for packaging files together (tar for instance). The problem was that to use these tools I needed disk space to store the output.
This problem has already been solved - if you need to tar up a set of files onto another computer you can simply tar to stdout and pipe that through a ssh connection:
1 $ tar cjBf - /source/dir | ssh host "cat > file.tbz"
So what I really needed was to be able to pipe data into S3, something akin to splitting a file, which in turn can be expressed very simply in python with something like (split.py):
1 #!/usr/bin/env python 2 3 import sys 4 5 chunk_size = int(sys.argv[1]) 6 split_prefix = sys.argv[2] 7 counter = 0 8 9 chunk = sys.stdin.read(chunk_size) 10 while len(chunk) > 0: 11 fh = open("%s-%05d" % (split_prefix, counter), "wb") 12 fh.write(chunk) 13 fh.close() 14 chunk = sys.stdin.read(chunk_size) 15 counter += 1
And that works - because all I need to do is replace those file writes with S3 RESTful PUTs and I'm there:
1 def write_data(cmdopts, cfg, conn): 2 """ Read data from STDIN and store it to Amazon S3 3 4 Exceptions will be raised for non-recoverable errors 5 """ 6 bucket = cfg.get('Bucket', 'id') 7 tag = cmdopts['tag'] 8 counter = 0 9 10 chunk = sys.stdin.read(chunk_size) 11 while len(chunk) > 0: 12 s3_chunk = '%s-%010d' % (tag, counter) 13 print "Uploading chunk: ", s3_chunk, 14 sys.stdout.flush() 15 resp = conn.put(bucket, s3_chunk, chunk) 16 assert resp.http_response.status == 200, resp.message 17 print "[DONE]" 18 sys.stdout.flush() 19 counter += 1 20 chunk = sys.stdin.read(chunk_size)
And if I can do that, then I should be able reverse the process and read my S3 content back through a pipe using something like:
1 def read_data(cmdopts, cfg, conn): 2 """ Read data from STDIN and store it to Amazon S3 3 """ 4 bucket = cfg.get('Bucket', 'id') 5 tag = cmdopts['tag'] 6 7 assert bucket in [x.name for x in conn.list_all_my_buckets().entries] 8 9 for name in [x.key for x in conn.list_bucket(bucket).entries]: 10 if ( name[:len(tag)+1] == '%s-' % tag ): 11 data = conn.get(bucket, name) 12 sys.stdout.write(data.object.data)
And because that's all feeling fairly useful I've wrapped it up in a little more code which makes things easy:
s3store.py
And finally - an example of using the script:
In...
1 $ tar czf - /path/to/something | ./s3store.py -w -t bob 2 tar: Removing leading `/' from member names 3 Deleting old data: bob-0000000000 [DONE] 4 Deleting old data: bob-0000000001 [DONE] 5 Deleting old data: bob-0000000002 [DONE] 6 Uploading chunk: bob-0000000000 [DONE] 7 Uploading chunk: bob-0000000001 [DONE] 8 Uploading chunk: bob-0000000002 [DONE] 9 . 10 . 11 .
And out...
1 $ ./s3store.py -r -t bob | tar tzf - 2 path/to/something/ 3 path/to/something/a/ 4 path/to/something/a/file.txt 5 path/to/something/b_file.txt 6 . 7 . 8 .
The script uses a python ConfigParser configuration script which looks like this:
[Credentials] aws_access_key_id: XXXXXXXXXXXXXXXXXXXX aws_secret_access_key: YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY [Bucket] id: data-backup
Every now and then I've looked at and discussed the various Amazon Web Services but have never actually got around to using any og them personally.
I still don't really need a dynamically and automatically scalable cluster cloud of virtual computers at my beck and call - however groovy that might seem.
What I do need at the moment is some extra storage space - both as a place to backup some data reliably and as a place to serve larger chunks of data which have a tendency to fill up the brilliant but not exactly storage-heavy VPS hosting solutions available these days.
Amazon Simple Storage Service (S3) to the rescue. S3 provides cheap storage via both REST and SOAP interfaces.
Better yet there's libraries already available in a number of languages - information and documentation is available at the Amazon S3 Community Code site.
In this case I'm using the Amazon S3 Library for REST in Python.
So what can I do with this?
Here's a rudimentary backup script: ( backup.py):
1 #!/usr/bin/env python 2 3 import os 4 import os.path 5 import sys 6 import time 7 8 import S3 9 10 AWS_ACCESS_KEY_ID = 'XXXXXXXXXXXXXXXXXXXX' 11 AWS_SECRET_ACCESS_KEY = 'YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY' 12 conn = S3.AWSAuthConnection(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) 13 14 time_stamp = time.strftime("%Y%m%d-%H%M%S") 15 backup_bucket = "backup" 16 17 print "Storing in %s [%s]" % (backup_bucket, time_stamp), 18 resp = conn.create_bucket(backup_bucket) 19 print resp.message 20 21 for base_dir in sys.argv: 22 print base_dir 23 for root, dirs, files in os.walk(base_dir): 24 print root 25 for file in files: 26 file_path = os.path.join(root, file) 27 fh = open(file_path, 'rb') 28 data = fh.read() 29 fh.close() 30 31 backup_path = os.path.join(time_stamp, file_path.lstrip('/')) 32 print " .. %s" % backup_path, 33 resp = conn.put(backup_bucket, backup_path, data) 34 print " [%s]" % resp.message
This will walk through a given set of directories and try and upload all regular files it finds. Note that no handling exists for failed uploads (I did say rudimentary) or for non-regular files like symlinks.
I suppose the easiest or most reliable way to make this work across all file types would be to just backup tarballs - on the other hand that means that I need to have the space to store the tarball which somewhat defeats the purpose of cheaper storage.
Having gone and pushed a whole lot of stuff into my S3 space I may as well delete it (if for nothing else then for an excercise in walking through the S3 contents).
So, here's a deletion script: ( clear_environment.py)
1 #!/usr/bin/env python 2 3 import S3 4 5 AWS_ACCESS_KEY_ID = 'XXXXXXXXXXXXXXXXXXXX' 6 AWS_SECRET_ACCESS_KEY = 'YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY' 7 conn = S3.AWSAuthConnection(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) 8 9 for bucket in conn.list_all_my_buckets().entries: 10 print bucket.name.encode('ascii', 'replace') 11 for item in conn.list_bucket(bucket.name).entries: 12 print " .. %s" % item.key.encode('ascii', 'replace'), 13 conn.delete(bucket.name, item.key) 14 print " [DELETED]" 15 conn.delete_bucket(bucket.name) 16 print "Deleted bucket"
Probably the main thing to note here is that Amazon S3 does not store objects in a hierarchy. There are a number of base level buckets (in this case named 'backup') which are then just filled up with uniquely keyed items.
A convention among various S3 file-storage/backup solutions has been to name this key using a unix-style path structure. If one of these solutions were to access the files stored by the backup script above they would allow navigation by 'directory' even though no actual directories existed.
Having played around with checking SMTP services for backup MX exchanges ("Testing SMTP exchanges") I then thought it would be useful to be able to easily trigger ETRN requests. Backup MX servers tend to poll the mail server periodically to do this automatically but being impatient...
Again using smtplib, this is even quicker and easier then the testing script:
1 #!/usr/bin/env python 2 3 import smtplib 4 5 backup_servers = { 'mx3.zoneedit.com' : [ 'bjdean.id.au' 6 , 'orientgrove.net' 7 ] 8 } 9 10 if __name__ == '__main__': 11 for backup_mx in backup_servers.keys(): 12 print ">>> Connecting to", backup_mx 13 server = smtplib.SMTP(backup_mx) 14 #server.set_debuglevel(1) 15 for domain in backup_servers[backup_mx]: 16 print ">>> >>> ETRN domain", domain 17 server.docmd('ETRN', domain)
And here's what I see (with debugging turned back on):
>>> Connecting to mx3.zoneedit.com >>> >>> ETRN domain bjdean.id.au send: 'ETRN bjdean.id.au\r\n' reply: '250 Queuing started\r\n' reply: retcode (250); Msg: Queuing started >>> >>> ETRN domain orientgrove.net send: 'ETRN orientgrove.net\r\n' reply: '250 Queuing started\r\n' reply: retcode (250); Msg: Queuing started
With a few domains in tow, and a few different live and backup MX exchanges attached to those, I needed a quick way to work out what was working and what wasn't.
dnspython and smtplib make for a very quick script which tells me everything I need to know.
With a few quick code adjustments I can disect the failures or view the complete SMTP transcript - particularly handy if I'm discussing issues with up-stream providers.
Here's the code:
1 #!/usr/bin/env python 2 3 import smtplib 4 import dns.resolver 5 6 domains = [ 'mydomain.id.au' 7 , 'myotherdomain.org' 8 , 'bjdean.id.au' 9 ] 10 11 def test_domain(domain): 12 print "Testing", domain 13 14 for server in dns.resolver.query(domain, 'MX'): 15 test_smtp(domain, str(server.exchange).strip('.')) 16 17 def test_smtp(domain, exchange): 18 print "Sending test message via exchange", exchange 19 fromaddr = "test_smtp_servers-FROM@%s" % (domain) 20 toaddr = "test_smtp_servers-TO@%s" % (domain) 21 subject = "Test via %s for %s" % (exchange, domain) 22 msg = "From: " + fromaddr + "\r\n" \ 23 + "To: " + toaddr + "\r\n" \ 24 + "Subject: " + subject + "\r\n" \ 25 + "\r\n\r\n" \ 26 + subject 27 28 server = smtplib.SMTP(exchange) 29 #server.set_debuglevel(1) 30 try: 31 server.sendmail(fromaddr, toaddr, msg) 32 except Exception, e: 33 print "EXCHANGE FAILED:", e 34 #import pdb; pdb.set_trace() 35 server.quit() 36 37 if __name__ == '__main__': 38 for domain in domains: 39 test_domain(domain)
And here's what I see:
Testing mydomain.id.au Sending test message via exchange mx1.mydomain.id.au Sending test message via exchange mx2.mydomain.id.au Testing myotherdomain.org Sending test message via exchange myotherdomain.org Testing bjdean.id.au Sending test message via exchange mail.bjdean.id.au Sending test message via exchange mx2.zoneedit.com EXCHANGE FAILED: {'test_smtp_servers-TO@bjdean.id.au': (554, '5.7.1: Relay access denied')}