Is Domino The Best Tool - Scalability & Readers Fields
Maybe you can help me decide how suitable Domino is for a project I've been approached about.
The client is a university with about 20,000 students. They want to give each student the ability to blog. Sounds simple enough, right? Well, the more I think about the more involved it becomes. You see, each entry in each student's blog is to be access controlled. Using either people and/or groups the blogger can control who sees what they've written.
The most obvious solution is to give each student their own NSF version of the blogging template and stick some readers fields in there. Not so fast though.
For obvious reasons the client doesn't want 20,000 NSFs to look after. Not only this but the whole thing should be searchable, so using one vast NSF is the only real way to go.
My concerns, which I've expressed with the client, are about how well Domino will cope with so many readers fields in a database that could potentially be quite large. However, they have an investment in Domino and most in-house admin skills are Domino-based, so they are really keen to see it work on the infrastructure they have.Hence I've come to you guys for help. What are your thoughts? How well do readers fields perform in large views? Not just views but in the inevitable LotusScript searches needed to build content on the fly.
Even assuming half the students take up the chance to blog and only half of them do it regularly - say twice a week - we could still be looking at 1,000,000 entries per year.
How much of a factor is the server's hardware? Apparently it's quad-core and has oodles of RAM.
What do they mean by "look after 20,000 nsf's"? You could create the admin interface that creates and/or archives the NSF's, and use domain search. Use an agent to expire a blog after x number of months/years. That way they could lock down blogs at the DB level and at the entry level. Also, this allows for easy URL naming (mySchoolDomain/blogs/studentName.nsf.
I can think of better tools, but 20,000 databases is not that big of a deal.
I don't really see the problem with using many databases. They can be searched using Domain Query on Db's marked as 'blog'. They'll all be template based and the benefits of retaining access should one go wrong, or need restoring etc outweigh any of the pitfalls I can see.
As I type this, I see Jeff has replied and I totally agree, an admin interface will simplify/automate many of the administrative burdons.
Interesting. Without any evidence to speak of I'd assumed that domain search was a bit of a joke and didn't really work.
Are you both telling me it's worth my time to investigate it further?
Does domain search take readers fields in to account?
Another requirement is that there be a main homepage for it all where you see the most recent entries from across all students and a list of most updated blogs. Maybe each blog could "ping" one main NSF where a list of all new updates is kept and monitored? Interesting thought.
Sorry I agree with Ian and Jeff, build the right tools to manage the 20,000 nsfs and admin becomes less of an issue. Lets face it, they are students and someone is bound to do something wrong and end up deleting all their blog entries, at that point restoring a single blog NSF, rather than unpicking one huge NSF and putting back the missing entries. ...
We have a mobile phone record database. Within one year the database grew to 500,000 records where approximately 200 people where able to edit there records.
We now introduced as archiving system to speedup the database and remove archived documents. the whole thing is a pain in the @$$, so speed was needed. opening view that needed indexing resulted in 4-5 minutes before the view was visible.
I would say, go for the 20,000 databases, put them in a category and do a category search.
I can't speak for domain search. but think of it this way - would you rather have 20k slow databases while a view is rebuilt, or a handful of databases being rebuilt while others are static?
From my experience, large databases in domino can be VERY fast, but they have to be fairly static (or you have to have DOZENS of drives to stripe the data over).
Graham makes a similar point with backup/restore issues.
I don't see how 20,000 blog files differ from the 20,000 mail files that they already have to manage. I'd just use the ND7 blog template - it would be nice if IBM made this a check box when creating users.
OK, I did a huge [roject a few years back where the client only wanted one DB for their records. It ended up with over 1.5 million documents, all with Readers/Authors fields everywhere, the database topped out at over 9.5gig.... and due to the views indexes was slower than treacle on a cold day.
I would reccommend using seperate DBs. It's not so bad. The only thing you really need to write would be the creation/setup routines. Obviously with Domino's shared template system, the actual DBs wouldn't even have the design in them... only data.
Putting everything in one DB would be a serious mistake in my experience. The indexes needed to keep the database working would seriously cripple a server.
I'm not sure why you would need readers fields though, unless you are specifically locking individual entries. Surely ACLs would be easier to handle?
Hmmm, if they want to display a list of the newest entries then if they also want to allow the students to maintain who can see their entries, then it's going to be fun and games with Readers fields anyway in your "main NSF file". Was going to suggest something RSS related, but again you lose the control the students have over access if you centrally sweep the feeds up for a "Latest blogs" display.
I still think the idea of seperate db's is the best way to go though, view indexes as the last two entries mention will kill the single db solution. Hadn't thought of that one, but from experience with large dbs they will take significant time to update and when you consider the potential for updates.
Taking your figures of 20,000 students, with a 50% take up and each blog updated twice a week, that's 20k blogs per week, just under 3k per day, assuming 24hr access, thats one blog every 30secs.
Yep 20,000 dbs is the way to go.
From the help file for Domain Search {Link} it says:
"Use Domain Search for less active databases such as archives and product specifications."
Is there a reason for this? Maybe Domain Search only updates very infrequently? Anybody know how often? The help file doesn't say.
Students are going to expect their new entry to appear in the search results almost instantly. A delay of minutes/hours would probably be an issue.
Jake
Back in the last century I worked on a Notes R4.5/6 newspaper editorial system. Every document had reader & author fields containing nested groups. The business had insisted on replicating the legacy mainframe system which translated into a single instance with 2 views of everything.
Despite the fact that the server was Sun's biggest and fastest box maxed out with superfast i/o etc the server used to freeze at peak production times. It turned out that it was down the sub-editors. They only had access to the Subs basket which was way down the view index. With 160+ concurrent users creating and modifying stories which had some complex script in the queryopen/save & close Domino died trying to find 64k of view to return.
Once the business accepted replacing their dependence on 2 monsters views of everything with a series of 'news desk' views which contained 4-12 baskets the servers didn't freeze. There was still a view of everything that was used for searching by a small number of 'super users'.
Shortly after R5 arrived with major view indexing and threading improvements .
Creating seperate databases for each desk was not a practical solution - everybody was working on a single title. In the case of your 20,000 blogs there are 20,000 independent 'titles'.
>For obvious reasons the client doesn't want 20,000 NSFs to look after
That's what we use computers for... Rather than creating a monolithic database let Domino do what it does best - serve lots of small-medium sized databases. Rather than waste time designing the mother of i/o bottlenecks develop a toolset for creating and deleting individual blogs.
Consider creating a seperate domain, 'theblogshere' and set up domain wide searching. As every database uses the same main form it should be pretty seamless.
Managing user space is 'out of the box ' Domino functionality. Users can customise the look and feel of their blog and even inherit from different templates.
If you want to be really lazy you could con Domino into thinking that the .nsfs in 'theblogsphere' domain were mail files (it won't know that the mail template inherits from the blog template...). That way the admins could apply there skills in creating policies and leveraging AdminP. And you won't need to create mail-in documents...
>Apparently it's quad-core and has oodles of RAM
Application design has big impact on i/o (see {Link} - old but still relevant).
Go with the obvious solution and deal/ameliorate the client's concerns (obviously nobody wants to have to do any work....). At the end of the day managing 20,000 users is going to require some resource. They will be deluding themselves to think having them all in one database eliminates the requirement. Ask Exchange admins about the problems of having single database for everybody's mail.
I hope that I have disuaded you from going for a monolithic solution.
Gerry
This may be the system Gerry refers to above -> {Link}
I have hit the problem with view performance and reader names fields as well - although not quite on that scale.
Multi-databases is the way to go.
Jake,
The Domain Search "feature" is provided by the Domain Indexer Domino task. I believe the task is not active on a default domino installation, so one would have to enable it.
It can be configured - when and how often it runs - and that basically means that a new blog entry would not be "searchable" until the next time the DItask runs.
One problem you might have lurks somewhere else: there can be (as far as I know) only one domain index per domain and it will include all the DBs that are flagged to be indexed. And anybody doing a search would get results from ALL the DBs he has access to (ACL and Readers fields would be used to show only the docs that a user should see), but you could not start a "Blogs only" search. So the server and the domain hosting the blogs should have no other DBs in the Domain Index list.
Multilpe Dbs.
Not only for performance, but for managability.
With a multiple Dbs model, you can simply create/remove the databases as needed. You get one central dircetory of databases, and can keep stats on them - size, access, etc.
You can also manage departmental blogs through such an interface if planned well - let it do all the ACL gruntwork.
Oh, and if you have to put any kind of re-write, redirect or configuration items in for the NAB for the webserver (subdomain per blog, for instance) then your management app can handle that too.
The only possible downside is the extra size of all those little Dbs. Which can be worked around by making your blogging template a Single Copy template anyway. As these dbs will be just one instance, with no replicas or other hassles, the normal downsides of Single Copy Templates don't really apply here. It's perfect for the job!
Oh, and on Domain Indexer...
It's slow. And you can only have one per domain, as Matjaz says. It's not what I'd use if I had to search multiple Domino Dbs.
Frankly, you'd almost be better off searching content with a more Google-esque engine. I'm being absolutely serious here - the Domain Indexer will be the weakest point in a multiple DB solution.
The Wikipedia article on web crawlers has a list of GPL search engines.
{Link}
I've not used any of them, but frankly if you want faster searching I'd try one of those on Apache (on a different port on the same server, linked to by the blogging template). It's got to be faster and more managable, and probably easier to set up than Domain Indexing too.
Just point the spider at a list generated by your management app, and away it goes!
Hi Jake,
As for the searching, why not create a seperate database that contains only searchable information about the documents. (i.e. This will reduce the number of fields and documents that the FT Index will have to index).
Each one of the search documents will map directly to a document within your blogs. You will of course have to write code to create, update and delete the search documents as needed.
The advantages of a search database, is that it will be quicker and easier to index, as you are only indexing documents and fields that you want as searchable.
Another plus to this, is that it will help you solve your other problem:
"Another requirement is that there be a main homepage for it all where you see the most recent entries from across all students and a list of most updated blogs."
Because you have this Search document, that is created, updated and deleted as with it's parent document, then you should be able to show this information easily enough.
Hope that helps.
Later
Patrick Niland
P.S. I have implemented this Search database solution a couple of months ago, in a database of just over 2 million documents. The FT Index went from over 2GB in size to about 250MB in size. Also, it removed over 1GB of view indexes. To add to that, searches were improved from 200 - 400%. It also reduced the processing on the server, etc...
P.P.S. The one thing that i have noticed with the FTIndex in domino, is that it indexes all forms and fields. You cannot selectively tell domino which forms and fields you want indexed........
Hi Jake,
A couple more points:
1) Why not remove the "Readers" and "Authors" fields from the Search documents. (i.e. Allow all users to search this information). This will get around the issue with the "Readers" and "Authors" fields.
Only when the user tries to open the actual document in the blog database, will the "Readers" and "Authors" field kick in.
2) Set the updating of the views and FT-Index of the Search database to schedules (e.g. hourly, daily, etc...)
3) You can remove the FT-Index from all of the Blog databases!!!!
Later
Patrick Niland
>> For obvious reasons the client doesn't want 20,000 NSFs to look after. Not only this but the whole thing should be searchable, so using one vast NSF is the only real way to go.
No, it's not the only real way to go
If your problem is only the searchable possibility, you can also use DOMGLE (by Julien) to search trough your 20,000 blogs ... but what is the scalability of DOMGLE with 20,000 NSFs ???
DOMGLE is here : {Link}
and Jake talk about DOLGLE here : {Link}
Jake,
From my experience, huge views with readers field will definitely slow down the performance of your application. I agree with all others in case of multiple databases. The only problem I see is in the domain search. We have also tried google appliances for searching domino data. The problem with it is that it doesn't respect the domino security 100%. If you have individual user names listed in the readers field, you are safe. But if you have a role in the reader field, it doesn't know how to handle it and may present a link to the user even if they don't have access to the document. Users won't be able to open the document though.
Check out the google mini search appliance...my client has it and I am just beginning to learn more about it, although they currently have it searching multiple nsfs with ease.
Go the sep. db route. The only real viable way to handle million+ documents with reader / author fields is to use separate views. And lots of them. In an app I had, I separated out the docs by unique id, which distributed pretty well over the letter/numbers allowed. It reduced the update operations from 15+ minutes to just over a minute. That's still not acceptable for what you're doing, which is why building an admin tool to create/manage/track databases is infinitely better. And, as luck would have it, very advantageous for yourself going forward.
@ Matjaz , you CAN restrict doman searches to only against certain categories of databases. so categorize each as 'blog' and only search against that category. However, if possible, I do like the idea of having a blogs domain.
You have two possibilities for the Searching. Database Search and Domain search.
If each database is FT indexed (set to immediate) then all stidents will be able to search their own blogs within a few minutes of entering the blog entry. That is bog standard stuff.
The Domain Search takes much longer due to the sheer amount of data. In my experience the Domain Search will index about 500-500meg of databases per hour. Thats a lot of stuff to sift through. Heck even Google only visits websites every couple of weeks to index/re-index pages.
Now you might be able to implement some kind of ping update so that when a new blog is created a seperate entry is added to a central database and indexed immediately but there is only so much that thi scentral DB could hold and it will need pruning occasionally. But at least it won't have the mucky views that will need re-indexing... it should only be used for searching purposes. Once you have the resultant doc you actually send the searcher to the real life record. But to my mind this is a really tacky search hack.
DB indexing and Domain indexing are two separate tasks and should be treated as such. Short of buying a specialist movabletype kind of setup, I think this is your better option if you want to pursue the Domino option.
After all, Domino R7 has all the necessary blogging templates straight out of the box. All you need to do is some fancy admin work and you're sorted.
Ok, I am a bit rusty on the Domain search, but Patrick's idea of a central mini-index database has some merit to it. It solves the search challenge, and makes the "most recent posts" requirement even easier.
Possibly just post content, student name, and the URL to the post would be enough.
On the subject of multi db searching, Nathan Freeman has posted an alpha version of "Haystack" on OpenNTF ({Link} which although still alpha code may be a good place to start. Haven't personally played with the code but wanted to make sure you were aware of it.
Jake,
It's a big Domino proposal - but not that big. It will all depend on how you build and optimise the infrastructure, but totally do-able. If you want to mail me about the admin side of it, feel free. Sounds cool
(and a whole lot better than Lotus Connections...)
Quick thought...take a look into the VIEW_REBUILD_DIR NOTES.INI setting from the perspective of moving view index processing to a potentially more robust set of drives.
@Matjaz and @Philip
You CAN have more than one Domain Index within one domain.
In our production environment, we have one domain across several locations. Each location has a separate domain index on its server, indexing locally available data(bases) only. This and the index update interval can be configured in the server document(s).
Provided the searching user is authenticated the search result list will only contain results which the seraching user has access to. The indexing server must have also read access to the indexed documents. Consider a GlobalReaders role in the reader field, assigned to the LocalDomainServers group.
I expect performance should be ok if you dedicate a server to blogs and index and activate SCT for the blog template, and/or you move static content (images and so on) to a single resources NSF or even to file system.
It looks as though Declan is thinking about a system with some similar ideas
{Link}
As for searching across multiple databases, check out Andrew Pollack's NCTSearch:
{Link}
It's a notes-centric solution and is up and running in no time.
An alternative:
IBM OmniFind Yahoo! Edition {Link} It's a crawler (therefore, I don't know how well it works with Notes), and it's free.
I know this may be a very unpopular idea (both with the hardened domino people and the client) but I would consider proposing an alternative solution using a relational database backend and writing the middleware in java.
This would give you the scaleability and manageability of the data while allowing complete control over the way that database searches are performed. The memory and cpu overheads inherent in maintaining so many b-tree indexes is also something you don't have to worry about which would greatly assist the applications performance. And getting the last n blog entries is simplicity itself.
You can still leverage Dominos NAB for authentication but you may even want to swap out the HTTP stack to Apache which could also act as a proxy and pass *.nsf requests to the Domino server. Authorisation would come from the relational database - as per the end users blog entry level specification.
I have been developing on and extending domino for a long time now - and please don't get me wrong, it's a great product - but sometimes its better to use the most suitable tool(s) to address a problem, and that's not always domino. It may not be what your client wants to hear either but it's better than not giving them the option and getting stuck in a potentially bad position. I hope this helps.
Garth. I thought SQL almost straight away and have talked about it with them. If all else fails they're willing to go down that route, but want to try and keep it Domino because they have backup routines in place etc.
A big thanks to the rest of you! It looks like the consensus is to have multiple DBs. This was what I'd originally though too, but the client suggested they preferred a single DB. Based on what I've learnt here (Domain Search and "data-only" DBs) I'm going to strongly suggest otherwise.
I'll let you know in due course what happens and how I get on with Domain search.
Jake
Jake, with this customer I would personally not have recommended Notes/Domino anyway since their are many very nice blogs on the internet for nop Pounds,Euros...so simple, no admins required and google does the rest...
Jaap. It's never that easy. I didn't recommend Notes. As with all my clients, they chose to come to me as they already knew they wanted a Notes solution.
I guess you know this already but as a university your client can use google search for free and possibly set up an account just for the blogs. (Though I've noticed that google can find it's way into parts of a domino app that you didn't think would be public...)
{Link}
With reader fields and a million documents, forget about a single database. I've worked with a commercial product that did that and performance became a bitch at 40.000 documents.
The vendor implemented a fix, but at the cost of enormous view indexes (> 2 GB). This fix basically involves resolving all the groups in reader fields and sticking all the names into a single field. The views used by users would then be categorized by that field and users would access the view with &RestrictToCategory=<user's name>.
The problem is due to the fact that the server needs to find a sufficient number of documents the user can see before any data is returned to the user. Now imagine the poor soul who has read access to only a handful of your 1 million documents. If these are spread out all over the db, the server basically needs to scan the entire million documents before the user gets his view.
The fix mentioned above causes each document to appear many, many times in the view, once for each person who can read the doc. That's why the view index becomes so large. The 2 GB view index was in a DB with 40K docs.
To help with your ft issues, you can have a look at an open source project called Lucene. This is the same engine that IBM/Yahoo uses in their omnifind product.
I admit I have never used it (yet), but I have seriously looked at it for implementation in searching several multi GB databases (in my case they are mostly static). It does seem very easy to integrate it with domino, either by using Java agents, or a scheduled java program using DIIOP to access the server.
There is also a specific filter class that can be used to filter data based on access control, but you would obviously have to build this filter yourself.
Jake, if you take a look at Lotus Quick Place's structure, it creates a DB for each place. It uses a "template" db. I'm not so familiar with it, but I think the Lotus team spent time on this issue "to use multi db or not"! I think this is a nice argument to tell your client.
Reader names are no concern in blog, I suppose. If you blog, you want to share PUBLICLY. so I suggest NOT to use RN, only AN. this way you have no problem with performance.
I think the main concern is disk space, you would want to follow some quota when having manu databases, Also ifyou full text each db that is taking disk space as well. Don't forget that view index run faster if given its own disk partion but make sure is not smaller than a few GB.
IBM had tens of thousands or like 11K, of mail miles on one server when Lotusphere had a large conference attendence, I have a friend with 40,000 on linux - he is an ASP solution provider. remember that if evertyone has the same design db then use single source as that saves space as well.
HTH
I'll probably have the unpopular opinion as well but this application is well beyond the groupware/loosely structured/disconnected/RAD fitpoints that Domino Excels at.
Some databases (SQL Server 2005 and Oracle that I know of) can automatically partition the database into separate physical files based on critieria you specify, such as blog owner (or month or anything else). This typically simplifies archive, delete and restore requests in addition to the performance gain. If you issue a statement to drop all records by owner X, the database is often smart enough to know that the statement can be handled by merely dropping the partition file rather than going after individual rows. Why try to write functionality that is already available.
My guess is that the users are going to be editing with an online editor, rather than using a Notes client and therefore, Notes doesn't provide any advantage in terms of content editing or storage for that matter. If anything, Notes has limitations on text fields (for html storage) that you'll wind up having to code around.
While domain searching is fast, you cannot narrow the search based on field criteria (such as not [status]=draft) and you can't manipulate the results since its not returning a collection.
A use case like this could benefit from some amount of partial page cacheing and you won't get that from Domino (by itself).
I think sometimes Domino gets a bad wrap when we shoehorn it into a use case that does not fit the platform. What the customer wants is completely separate from what the right choice is. Knowing full well that you are a respected Domino developer, if you tell them this not a good Domino fit, they should be respecting the fact that you are trying to steer them to the best solution and not towards one that puts pounds in your pocket. You don't pull up a roadway with a chisel right?
Hi Jake,
Almost everyone is talking about splitting necessity. In this case I think that the golden mean would be granularity "management". For example, your solution could be designed according to the following formula:
20,000 students = 2,000 students * 10 dbs
In addition, most active students should be normally distributed among those databases.
Bogdan
This sounds like a cool project!
I tend to agree with the consensus and would go the route of individual databases, building an admin tool and landing page as a front end.
Warren, I'm interested in your comment on Lotus Connections. What is it you have seen that leads you to the conclusion that it wouldn't be up to the job? Surely this is the type of situation that IBM would be looking to cover with the product? To avoid hijacking Jakes topic, I have posted more on this here: {Link} !
2cts:
1ct: I love the stable and simple design of Ferdy Christants blog template (XHTML).
2ct: why bother with reader fields, why not just store the login name (the email address) and the password in the database and display / hide subforms with content and functionality and enable SaveOptions with a cookie?
My suggestion would be to have a main central 'registration' database where the students would sign-up for their blog. Code in that registration process would add the user to an existing shared blog database (say 10 users per) or a new one if the limit was exceeded (maintained in profile docs).
And for the 'latest entries' issue, make an agent part of the blog submission that qualifies a blog entry (access for all to read) and then sends every new entry to that Main registration db.. From there you can keep the last 50 submitted and delete the rest on a scheduled basis.
Hey there Jake.
As far as searching that many .nsf goes; I have to second Gerald's opinon, Andrew Pollack's NCTSearch is the only way to go: {Link}
-Devin.
I think Domino is not suitable for the needs you face.The great quantity of user and even a bigger number of blog entry will exhaust resource otherwise enough for a different choice like RDBMS+Java,.net,php if you create a .nsf db for each user or add much complexity to the design if you create one db and control rights with readers field.