betabug... Sascha Welter

home english | home deutsch | Site Map | Sascha | Kontakt | Pro | Weblog | Wiki

15 April 2006

Pushing The Limit On BTreeFolder2

Who needs SQL anyway?
 

The last few days I've been working at one part of our app that holds lots of small information bits. They are structured very uniform, so there was the thought to take them out of the ZODB and put them into some SQL database. But so far our app has no need for SQL anywhere and adding the installation of a RDBMS to our servers and developers machines would add a lot of complexity to the setup. Given the data structure of BTrees in python, Zope and the ZODB gave some options that I wanted to test and explore first...


Holding on to an object oriented structure and the ZODB, I can see some main strategies for implementing things:

  • Use BTrees in a main "bucket" style object and keep our bits of information contained in one or several BTrees.
  • Use BTreeFolder2 as a "bucket" to store other Zope objects (in this case based on SimpleItem), which in fact boils down to the first strategy behind the scenes, but the hard parts are already being taken care of.
  • Spreading out a structure of folderish Zope objects so we do not have an overload of many objects inside one other object. This works really well if there is some inherently parent-child like structure between your objects, but it doesn't work out in this part of our app.

    What do we expect?

    From this point of "brain storming" on I gathered some facts, mainly "how many objects are we expecting?", "how can they be structured?" and "how do they need to be accessed, reported on, etc?" Turns out we do not have that many objects to be expected, especially if we implement some kind of housekeeping. Most of these objects have to hang around for about a year and a half, after that they are old news. If we do not fall into the old "let's ignore that our db will grow forever" trap and plan for removal of our old objects, we can assume we will have a maximum of about 8000 - 10000 main objects, each with at most 15 - 20 subobjects. Not that much, but we still have to deal a. with a lot of objects for a simple Zope folder and b. a lot of objects in total.

    My goal was to build the stupidest solution I could get away with.

    That's sounds dumb, but in real life it's a good plan. A "stupid" solution for me involves using as many things you can reuse from standard Zope products, to use normal Zope object behavior, to refrain from coding "clever" hacks and special stuff nobody understands after 6 months. So doing things the "normal" Zope way would be my choice unless failing performance would force me otherwise. Stress testing and mass testing would have to show if I was on the right track.

    The outline of the implementation

    I went on and hacked my initial code together. Got it running both in some Unit Tests and in a minimal version of a user interface. Basically it's a class based on BTreeFolder2 as a main container. This one contains by default a ZCatalog. Inside that are the great mass of objects, based on OrderedFolder. And in them are the smaller subobjects who are based on SimpleItem. Both the main mass of objects and all the subobjects are cataloged in that ZCatalog.

    Something is wrong

    Then I wrote another method. This "mass_bash" method would create 1000 objects, with each between 0 and 28 subobjects. There is some play in that, so the contents of the fields are not too much the same, but I did not care much about real randomness. My first tests started well enough, the mass_bash method ran through and my main bucket still worked. But with each run of mass_bash (and another 1000 of objects), the time required to add new objects increased a lot. At first mass_bash ran real fast, after a while it took half an hour, then an hour, later even one and a half hours to run! Something was clearly wrong.

    At first I had suspected my extensive use of metadata columns in the catalog. The general wisdom is that metadata columns in ZCatalogs make retrieval of object attributes obviously a lot faster, but writes to the catalog take a lot longer too. So out went the metadata. To my dismay that did not make that much of a difference. The particular workday in question came to an end and I had to sleep over it. Which was a good idea, as with the dawn my mistake also dawned on me. I had made a stupid coding mistake.

    Conclusion: Who needs SQL anyway?

    My code for assigning new ID's to new objects went through all possible integer IDs till it found a "free" one. Nice strategy when you expect 20 objects (or even 100) in a container. But looping through 5000 numbers each time you add another object can't really be called optimal. I'd call that "fucking stupid" on my part. It took me a short while to devise a new strategy. My first idea was to ask the catalog for the highest ID, i.e. make a simple query, sort by ID, convert to integer and take the next one. That didn't work, because the catalog's idea of sorting isn't the same as an integer's idea of sorting. But given the fact that I allocate all my ID's in the same way, even through the same method, I had another option. I asked the catalog for the count of objects of that given meta_type in my "bucket". That is my starting point to find the next "free" integer to be turned into an ID. That worked fine.

    Update: d2m informed me on #zope that BTreeFolder2 offers its own method generateId, which should do what I need. Well, yeah, reading the API is always a good idea :-). Thanks d2m!

    The result? Even with 12000 or 13000 objects, the script that adds another 1000 objects takes about 1.5 minutes - instead of 1.5 hours. Optimizing stupid code gives instant gratification. I also learned that a BTreeFolder and ZCatalog combination doesn't break a sweat with 15000 objects in the BTreeFolder and about 80000 - 100000 in the ZCatalog. The user interface that displays 50 items in batches, adds and edits such objects is responsive even if both the Zope server and the browser live on my own workstation (which is a dual G4 1Ghz, with constantly overfilled, fragmented HD). I expect acceptable performance on our production server, though I might do some testing with ab or siege still. But SQL is out of the game for now.

    Posted by betabug at 17:33 | Comments (4) | Trackbacks (0)
  • ch athens
    Life in Athens (Greece) for a foreigner from the other side of the mountains. And with an interest in digital life and the feeling of change in a big city. Multilingual English - German - Greek.
    Main blog page
    Recent Entries
    Best of
    Some of the most sought after posts, judging from access logs and search engine queries.

    Apple & Macintosh:
    Security & Privacy:
    Misc technical:
    Athens for tourists and visitors:
    Life in general:
    << Chäs-Hörnli Woche | Main | A Picnic At The Park, Leonardo, and the Blues >>
    Comments
    Re: Pushing The Limit On BTreeFolder2

    Nice post! The question for me (who always uses PostgreSQL to store data) here is And what about searching inside such data? I mean, i usually use a sql database to store data because managing such data (do queries on it, for example) is a lot easier. I know you can search the ZCatalog too, but dunno how it is compared to a SQL database.

    Another interesting question is how do you backup all that data? with PostgreSQL, for example, you can use pg_dump to create a simple SQL file with all the contents of the database in it. You can use later such file to recreate the database on this server, or another one...

    Posted by: Wu at April 20,2006 10:56
    Re: Pushing The Limit On BTreeFolder2

    Hey Wu! Thanks for the comment! I think I'll write another post to answer your questions. Searching is really what the ZCatalog is all about, but the object oriented background makes things a lot different than SQL...

    Posted by: betabug at April 26,2006 11:47
    Re: Pushing The Limit On BTreeFolder2

    Interesting. One problem I've had with BTrees vs. SQL is sorting. Dieter Maurer's AdvancedQuery might have the answer but I haven't looked at it yet. Have you?

    Posted by: Peter Bengtsson at July 07,2006 16:45
    Re: Pushing The Limit On BTreeFolder2

    I'm using BtreeFolder for the first time,I followed the test code and was able to add btreefoldrs and study them.If I now want to add something else,like say a dictionary:
    f = BTreeFolder2('sufest')
    f2 = BTreeFolder2('somefolder')
    f3 = BTreeFolder2('somefolder2')
    f._setObject(f2.id, f2)
    f._setObject(f3.id, f3)
    uuid = f.generateId()
    str1 = '111'
    f._setObject(uuid, str1)

    It throws up an error :Error Value: 'str' object has no attribute '__of__'.

    When you say you can add items in a BTreeFolder, how do you add other objects in like dictionaries, strings or oobtrees etc to a btrefolder?

    Posted by: sz at November 25,2008 16:52
    Trackbacks
    You can trackback to: http://betabug.ch/blogs/ch-athens/347/tbping
    There are no trackbacks.
    Leave a comment