July 1996 Date: Tue, 02 Jul 1996 13:59:56 -0700 To: genweb@UCSD.EDU From: Jeff Murphy Subject: html generators Pam Carey is in the process of trying to compile a complete list of html generators, their authors, URLs for download, etc. If you know of one, or are working on one, I know she would like to hear from you to include you in the list. Jeff Murphy 735 NW 8th Redmond, Oregon 97756 h. (541) 548-4478 Specializing in the genealogy of Muhlenberg Co., Kentucky USA GenWeb Project: http://www.teleport.com/~jmurphy/states.html http://www.dsenter.com/lists/states.html to subscribe to mailing lists From: "George Waller at Home" Organization: University of Connecticut To: genweb@UCSD.EDU Date: Wed, 3 Jul 1996 23:54:43 -0400 Subject: Re: Proposed unique ID Reply-to: gwaller@lib.uconn.edu Priority: normal X-mailer: Pegasus Mail for Windows (v2.23) Message-ID: John Rigdon makes a very good proposal below. I would expand his ID with four more elements: Soundex for middle name, place of birth, place of death, and gender. So I would be: W525G333A12519460508IN??????????M (making up the soundex codes, lazy fellow that I am :-) Suggest ?? for unknown rather than XX Am assuming these codes would be computer generated so the cost would be storage space and processing time for the match. Also like adding the locations since the field could be used for searching too. George Waller, hbladm1@uconnvm.uconn.edu On 12 Jun 96 at 19:22, JohnR238@aol.com wrote: > In a message dated 96-06-12 10:21:24 EDT, you write: > > >How many elements do we need to make an id unique using only the > >elements in GEDCOM? surname+firstname+birthyear+ ....?? > > It's interesting that you make this first step in the direction of > defining the ID, because we've been moving in the same direction > with the KYGENWEB project and my Genealogist's Index to the World > Wide Web. > > Here is my proposal which will generate an almost unique ID for > datasets where we have complete data, and for incomplete data will > still allow easy correlation. > > LAST NAME Soundex > First Name Soundex > BIRTHDATE > DEATHDATE > > Thus a complete id for me would be 24 characters > > R235J50019530818xxxxxxxx > > and hopefully some kind soul will someday fill in the rest of the > x's. > > For incomplete data there would be more likelyhood of index > collissions, but this could be handled within the search / indexing > schemes. > > This 24 character id has several advantages. > > Almost unique > Easy to decipher by hoomans > easily sortable and manipulated by computers > > The second tier reference file(s) / indexes can identify where the > particular id code occurs and the process then becomes managable to > drill down to a particular data object. > > Anders addresses some valid points regarding the existence of these > objects in cyberspace or other universes, and the fact that the > reference file can be structured to identify these other sources. > For most of these object classes identification schemes have already > been agreed upon (ISBN #, Library of Congress #, IGI reference, home > addresses, SS# - whatever) and we should move in the direction of > identifying each class that is pertinent to us and agreeing to call > it the same thing in the GEDCOM tags. > > The fact that the current GEDCOM spec does not provide for this > means that we should collectively agree to proceed with a standard > TAG or NOTE or CONT (or whatever) followed by our object > class:reference location:individual id tag. > As we move forward, the GEDCOM spec and software vendors will > incorporate > it. > > Comments are welcome. > > John Rigdon > GEN WEB Master > The Genealogist's Index to the World Wide Web From: "George Waller at Home" Organization: University of Connecticut To: Alf Christophersen Date: Thu, 4 Jul 1996 10:57:00 -0400 Subject: Re: Proposed unique ID Reply-to: gwaller@lib.uconn.edu CC: genweb@UCSD.EDU Priority: normal X-mailer: Pegasus Mail for Windows (v2.23) Message-ID: <16534C5889@libstaff.lib.uconn.edu> On 4 Jul 96 at 8:53, Alf Christophersen wrote: > At 23:54 03.07.96 -0400, you wrote: > >John Rigdon makes a very good proposal below. I would expand his > >ID with four more elements: Soundex for middle name, place of > >birth, place of death, and gender. > > > >So I would be: W525G333A12519460508IN??????????M > > > >(making up the soundex codes, lazy fellow that I am :-) > >Suggest ?? for unknown rather than XX > > > >Am assuming these codes would be computer generated so the cost > >would be storage space and processing time for the match. Also > >like adding the locations since the field could be used for > >searching too. > > > > I would also add a nation tag too, eg. soundex for country, written > in English. I try hard to not be a US centrist and failed. Of course there should be a country code plus a state/province/etc code. Sorry. --George Date: Thu, 04 Jul 1996 10:43:04 -0700 From: Pam Carey X-Mailer: Mozilla 2.0 (Win16; I) MIME-Version: 1.0 To: genweb@UCSD.EDU CC: kygenweb-l@teleport.com Subject: GED >> HTML Generator List Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit For those of you interested, here are the ged>>html generators that I have found so far. Features noted were 1)ability to link indiv. in remote databases 2) Indexable? 3) System Requirements 4) Static -vs- Dynamic (i.e. storage space req.) 5) the privacy issue 6) provisions for source citation 7) flexibility of output reports 8) photo capabilities 9) inclusion of biographical data, or notes. Not all features will be important to all of you as individuals, but for the scope of our project, they all warrant discussion. Space won't allow me to give you all the details on each program, so I've opted to direct you to the information pages instead. I've also included the URL for a sample of each. Check them out, read about them, then come back with your comments. They all have something to offer. Of course, as new programs are discovered, I'll let you know.......... If any of you are working on developing a program of your own, send us some information and a working sample, and I'll add it to the list for discussion. Pam Carey Durstock KY GenWeb Project In no particular order: Program: uFTi Author: Nicholas Oughtibridge <100020.1117@CompuServ.COM> Working Sample: http://gendex.com/~guest/oughnic/ufti/index.html Description/Info: http://ourworld.compuserve.com/homepages/oughtibridge/ufti.htm Download site: none yet - available thru CompuServ eventually Program: Indexed GEDCOM Method (IGM) Author: Tim Doyle Working Sample: http://audio.edge.net/~gumby/genweb/Winch/Winch.html Description/Info: http://sillyg.doit.com/genweb/igm.html Download site: from info page Program: WEBGEN Author: Rex Myer Working Sample: http://www.surfutah.com/web/webgen/ Description/Info: http://www.surfutah.com/web/webgen/tips.html Download site: privately owned - waiting on word from Rex Program: GED2HTML Author: Gene Stark Working Sample: http://pages.prodigy.com/rivette/surnames.htm Description/Info: http://bsd7.cs.sunysb.edu/%7Estark/ged2html/ Download site: from documentation page Program: Indexed File Method Author: Vic Abell Working Sample: http://www.halcyon.com/darinb/geneo/boesch/gedx.html Description/Info: http://sillyg.doit.com/genweb/how-its-done.html Download site: ftp://vic.cc.purdue.edu/pub/ged2WWW.tar.z (???) Program: Derivative of Indexed File Method Author: Brian Tompsett Working Sample: http://www.dcs.hull.ac.uk/public/genealogy/royal Description/Info: coming soon Download site: none yet, contact Brian Program: Lifelines Author: Tom Wetmore Working Sample: *DEMO* http://www.rahul.net/svpafug/demo1.html Description/Info: http://genealogy.org/~ttw/lines/lines.html Download site: ftp://ftp.cac.psu.edu/pub/genealogy/lines From: "George Waller at Home" Organization: University of Connecticut To: mavrogeorge@genealogysf.com Date: Thu, 4 Jul 1996 12:05:42 -0400 Subject: Re: Proposed unique ID Reply-to: gwaller@lib.uconn.edu CC: genweb@UCSD.EDU Priority: normal X-mailer: Pegasus Mail for Windows (v2.23) Message-ID: <1778601C7C@libstaff.lib.uconn.edu> On 4 Jul 96 at 8:49, mavrogeorge@genealogysf.com wrote (to me privately): > Since we are designing something new why are we continuing to use > the Russell Coding method for the soundex. That old X999 code has no > redeeming virtues to justify keeping it. We should be using a more > modern coding method that has better "hits" ratios and is not > "English-centric". Brian, good point... any ideas? Thanks, George From: "George Waller at Home" Organization: University of Connecticut To: Rob Joyce Date: Thu, 4 Jul 1996 13:27:54 -0400 Subject: Re: Proposed unique ID Reply-to: gwaller@lib.uconn.edu CC: genweb@UCSD.EDU Priority: normal X-mailer: Pegasus Mail for Windows (v2.23) Message-ID: <18D78B2FA7@libstaff.lib.uconn.edu> Rob, Am equally ignorant, so our contributions may be causing the experts some mirth. The way I think of this is that we want to be able to link online databases using a unique ID. My sense is that the IDs cannot be fixed (as you indicate below). So, the linking mechanism (i.e. the URL in conjuction with the browser) needs to be somewhat intelligent. During the initial linking process, the URL will return matches based on the searcher's definition of match. For example, I might set a match to require match on surname, first name and birthdate within 10 years. From the matches returned, I might "lock in" to one of them. If the ID to which I locked changed then the linkage process would notify me. Easy to say, and probably inadequate.. but hey it's a holiday here :-) --George On 4 Jul 96 at 13:05, Rob Joyce wrote: > I'm jumping into the middle of a thread, so pardon me if it's an > ignorant question, but here goes: > > I agree that the proposals being floated will create unique or > nearly unique ID's, but how would one deal with ambiguous data? I > often have estimated or approximated dates for the people I am most > interested in (because I don't know all the facts yet). Also, as I > do learn more info about a person, or find out I had incorrect > information, their unique ID is now changed. Thoughts or comments? > > Rob > > > > On 12 Jun 96 at 19:22, JohnR238@aol.com wrote: > > > >> In a message dated 96-06-12 10:21:24 EDT, you write: > >> > >> >How many elements do we need to make an id unique using only the > >> >elements in GEDCOM? surname+firstname+birthyear+ ....?? > >> > >> It's interesting that you make this first step in the direction > >> of defining the ID, because we've been moving in the same > >> direction with the KYGENWEB project and my Genealogist's Index to > >> the World Wide Web. > >> > >> Here is my proposal which will generate an almost unique ID for > >> datasets where we have complete data, and for incomplete data > >> will still allow easy correlation. > >> > >> LAST NAME Soundex > >> First Name Soundex > >> BIRTHDATE > >> DEATHDATE > >> > >> Thus a complete id for me would be 24 characters > >> > >> R235J50019530818xxxxxxxx > >> > >> and hopefully some kind soul will someday fill in the rest of the > >> x's. > >> > >> For incomplete data there would be more likelyhood of index > >> collissions, but this could be handled within the search / > >> indexing schemes. > >> > >> This 24 character id has several advantages. > >> > >> Almost unique > >> Easy to decipher by hoomans > >> easily sortable and manipulated by computers > >> > >> The second tier reference file(s) / indexes can identify where > >> the particular id code occurs and the process then becomes > >> managable to drill down to a particular data object. > >> > >> Anders addresses some valid points regarding the existence of > >> these objects in cyberspace or other universes, and the fact that > >> the reference file can be structured to identify these other > >> sources. For most of these object classes identification schemes > >> have already been agreed upon (ISBN #, Library of Congress #, IGI > >> reference, home addresses, SS# - whatever) and we should move in > >> the direction of identifying each class that is pertinent to us > >> and agreeing to call it the same thing in the GEDCOM tags. > >> > >> The fact that the current GEDCOM spec does not provide for this > >> means that we should collectively agree to proceed with a > >> standard TAG or NOTE or CONT (or whatever) followed by our object > >> class:reference location:individual id tag. > >> As we move forward, the GEDCOM spec and software vendors will > >> incorporate > >> it. > >> > >> Comments are welcome. > >> > >> John Rigdon > >> GEN WEB Master > >> The Genealogist's Index to the World Wide Web > > > > > > --------------------------------------------------------------- Rob > Joyce rjoyce@clark.net 2400 Winding Ridge Road / Odenton, MD 21113 / > (410) 672-6670 > --------------------------------------------------------------- > Date: Thu, 4 Jul 1996 13:50:39 -0400 To: Pam Carey , genweb@UCSD.EDU From: Brian Davis Subject: Re: GED >> HTML Generator List Cc: kygenweb-l@teleport.com Thanks for compiling this information. You are missing Sparrowhawk, the Macintosh version of GED2HTML. See for more information. At 1:43 PM 7/4/96, Pam Carey wrote: > For those of you interested, here are the ged>>html generators that I > have found so far. Features noted were 1)ability to link indiv. in [bandwidth conservation snip...] ------------------------------------------------------------------------------ Brian Davis brdav1@comet.net AOL: BrianDavis Home page: http://www.med.virginia.edu/~bd2m/ Date: Thu, 04 Jul 1996 11:03:59 -0700 To: Pam Carey From: Jeff Murphy Subject: GED >> HTML Generator List Cc: genweb@UCSD.EDU At 10:43 AM 7/4/96 -0700, Pam Carey wrote: >For those of you interested, here are the ged>>html generators that I >have found so far. Features noted were 1)ability to link indiv. in >remote databases 2) Indexable? 3) System Requirements 4) Static -vs- >Dynamic (i.e. storage space req.) 5) the privacy issue 6) provisions >for source citation 7) flexibility of output reports 8) photo >capabilities 9) inclusion of biographical data, or notes. Not all >features will be important to all of you as individuals, but for the >scope of our project, they all warrant discussion. > >Space won't allow me to give you all the details on each program, so I've I wish you would go ahead and include the details in with your list. It would make it a lot more useful for me. But it is great to finally have a list to work from! One thing I'm curious about: have you considered doing some testing on a limited database to see the total space requirements? In other words, how much space is taken up by the pages generated by ged2html and uFTi pages, as opposed to LifeLines (which I assume would be the size of the gedcom)? Date: Thu, 04 Jul 1996 11:03:47 -0700 To: genweb@UCSD.EDU From: Jeff Murphy Subject: Re: Proposed unique ID At 11:54 PM 7/3/96 -0400, George Waller at Home wrote: >John Rigdon makes a very good proposal below. I would expand his ID >with four more elements: Soundex for middle name, place of birth, >place of death, and gender. > >So I would be: W525G333A12519460508IN??????????M > On 12 Jun 96 at 19:22, JohnR238@aol.com wrote: >> Here is my proposal which will generate an almost unique ID for >> datasets where we have complete data, and for incomplete data will >> still allow easy correlation. >> >> LAST NAME Soundex >> First Name Soundex >> BIRTHDATE >> DEATHDATE >> >> Thus a complete id for me would be 24 characters >> >> R235J50019530818xxxxxxxx It seems to me that this could be simplified even further, by running a program which would take your 33 character id and compress it. Out of curiosity, I created a file containing only your id, above, and tried using pkzip on it. Gave me 0% compression and stored it in a file of 139 characters. :-) I guess that's not the way, but surely a way could be devised.... With the development of PAF 3.0, which will have ID Number and Ancestral File Number in separate fields (I believe the gedcom standard will be 5.5 - is that right?), and similar improvements already in effect in other software, there will be a field in the gedcom available to carry this information. Then the compressed field could be contained in the field designated for the ID number. The only two problems will be to come up with a program which will generate the ID number, and one which will be able to find the ids in all gedcoms and index them. Then, as the html generator runs, it would have to hit against this index to determine whether or not the id exists in more than one gedcom. We're still limited by the access time over the Web. Jeff Murphy 735 NW 8th Redmond, Oregon 97756 h. (541) 548-4478 Specializing in the genealogy of Muhlenberg Co., Kentucky USA GenWeb Project: http://www.teleport.com/~jmurphy/states.html http://www.dsenter.com/lists/states.html to subscribe to mailing lists Date: Thu, 04 Jul 1996 11:04:03 -0700 To: genweb@UCSD.EDU From: Jeff Murphy Subject: Re: GED >> HTML Generator List At 10:43 AM 7/4/96 -0700, Pam Carey wrote: >... 3) System Requirements 4) Static -vs- >Dynamic (i.e. storage space req.) .... One of the issues I would like to see addressed is whether or not the software in question can be run at one site against a gedcom at a different location. It seems to me this would reduce the amount of space required on any single site. The trouble is, it may not work very efficiently. But can a Unix-based program run against data on a Mac? Why not, if it's ascii data? Can Rex Myers' software be run dynamically against a gedcom and produce html for the user? Another question I had was regarding indexes. I would like to see the necessary index entries for both Gene Stark's and John Rigdon's indexes generated by whatever software product we end up choosing to use for the US GenWeb Project. Can you list what indexes, if any, are produced by the various software? Date: Thu, 04 Jul 1996 11:03:49 -0700 To: genweb@UCSD.EDU From: Jeff Murphy Subject: Re: List is ready Cc: smcgee@genealogy.org At 12:41 AM 7/4/96 -0700, Pam Carey wrote: >Hi Guys! >It's 12:40 a.m. EDT, and I finally have something presentable to the >genweb group. Only question now is, should I also post it to >KYGENWEB? I think I would like to wait on that until we've had a chance to talk over the pros and cons of the various packages. >Oops, a separate question for each of you (been wantin' to ask you >for a few days). Jeff, I don't have Scott McGee on my list. Should >he be? Scott uses Tom Wetmore's LifeLines, and I don't know if he has made improvements outside of Tom's program or not. I'll cc: this to him, and if he has modified things, hopefully he'll bring it up. :-) Jeff Murphy 735 NW 8th Redmond, Oregon 97756 h. (541) 548-4478 Specializing in the genealogy of Muhlenberg Co., Kentucky USA GenWeb Project: http://www.teleport.com/~jmurphy/states.html http://www.dsenter.com/lists/states.html to subscribe to mailing lists From list-relay@UCSD.EDU Thu Jul 4 11:52:25 1996 Received: from UCSD.EDU (mailbox1.ucsd.edu [132.239.1.53]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id LAA19170 for ; Thu, 4 Jul 1996 11:52:19 -0700 Received: from none.at.helo (arachnet.algroup.co.uk [194.128.162.1]) by UCSD.EDU (8.7.5/8.6.9) with SMTP id LAA02973 for ; Thu, 4 Jul 1996 11:47:59 -0700 (PDT) Received: from heap.ben.algroup.co.uk by arachnet.algroup.co.uk id aa20076; 4 Jul 96 19:47 BST Received: from gonzo.ben.algroup.co.uk by heap.ben.algroup.co.uk id aa22616; 4 Jul 96 19:07 BST Subject: Re: GED >> HTML Generator List To: Jeff Murphy Date: Thu, 4 Jul 1996 19:02:47 +0100 (BST) From: Ben Laurie Cc: genweb@UCSD.EDU In-Reply-To: <1.5.4.32.19960704180403.008cd358@mail.teleport.com> from "Jeff Murphy" at Jul 4, 96 11:04:03 am Reply-To: ben@algroup.co.uk X-Mailer: ELM [version 2.4 PL24 PGP2] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID: <9607041902.aa25410@gonzo.ben.algroup.co.uk> Jeff Murphy wrote: > > At 10:43 AM 7/4/96 -0700, Pam Carey wrote: > >... 3) System Requirements 4) Static -vs- > >Dynamic (i.e. storage space req.) .... > > One of the issues I would like to see addressed is whether or not the > software in question can be run at one site against a gedcom at a different > location. It seems to me this would reduce the amount of space required on > any single site. The trouble is, it may not work very efficiently. But can > a Unix-based program run against data on a Mac? Why not, if it's ascii > data? Can Rex Myers' software be run dynamically against a gedcom and > produce html for the user? I'm interested to know how the HTML generator is supposed to get access to the gedcom (at the other site)? NFS? FTP? HTTP? Anyway - this is unlikely to work well - a system I'm working on needs to see the whole GEDCOM file, which can be several hundred k. I'd hate to have to wait for that to trickle down across the Internet in order to see one person's page. Unless, of course, we get into GEDCOM servers ... hmmm, now there's a thought... Cheers, Ben. > > Another question I had was regarding indexes. I would like to see the > necessary index entries for both Gene Stark's and John Rigdon's indexes > generated by whatever software product we end up choosing to use for the US > GenWeb Project. Can you list what indexes, if any, are produced by the > various software? > > -- Ben Laurie Phone: +44 (181) 994 6435 Freelance Consultant and Fax: +44 (181) 994 6472 Technical Director Email: ben@algroup.co.uk A.L. Digital Ltd, URL: http://www.algroup.co.uk London, England. Date: Thu, 04 Jul 1996 13:23:52 -0700 From: Pam Carey X-Mailer: Mozilla 2.0 (Win16; I) MIME-Version: 1.0 To: Alf Christophersen CC: kygenweb-l@teleport.com, genweb@UCSD.EDU Subject: Re: GED >> HTML Generator List References: <199607041558.RAA12810@ulrik.uio.no> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Alf, I'm sending a CC of this response to your messages to the groups: First, from an earlier message: Alf Christophersen wrote: > Did you put this in a html-file? Or would you like to have me do it? > If so, I can put it on http://www.sn.no/disnorge/htlmgeds.htm and you > can give me hints when new ones are detected. No, and yes. Fantastic idea! Then, the second message said: > > Try http://www.sn.no/disnorge/htmlgeds.htm > > It is not pointed to by any other ressource, until you say it is ok > to have it there. > It's OK to have it there. Have already checked it out. Looks great! Thank you, very much. OK everybody, we now have one central location to go to, to view all the different programs available, thanks to Alf. Check it out! Pam Carey Durstock KY GenWeb Project Date: Thu, 04 Jul 1996 14:38:25 -0700 From: Pam Carey X-Mailer: Mozilla 2.0 (Win16; I) MIME-Version: 1.0 To: Jeff Murphy CC: genweb@UCSD.EDU Subject: Re: GED >> HTML Generator List References: <1.5.4.32.19960704180359.008f5c30@mail.teleport.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Jeff Murphy wrote: > > I wish you would go ahead and include the details in with your list. > It would make it a lot more useful for me. But it is great to finally > have a list to work from! I originally planned on it, but I was afraid that the list would never hit the airwaves. I was working under a self-imposed deadline. Can and will do, but I'll need time. > > One thing I'm curious about: have you considered doing some testing on > a limited database to see the total space requirements? In other > words, how much space is taken up by the pages generated by ged2html > and uFTi pages, as opposed to LifeLines (which I assume would be the > size of the gedcom)? Yup. Started that, too. Haven't been able to test all of them yet, but I'll publish the results as soon as I've got them all together. I guarantee you'll all know me and my family *very* well by the time I'm through . Pam Date: Thu, 04 Jul 1996 14:26:07 -0700 From: Pam Carey X-Mailer: Mozilla 2.0 (Win16; I) MIME-Version: 1.0 To: Brian Davis CC: genweb@UCSD.EDU, kygenweb-l@teleport.com Subject: Re: GED >> HTML Generator List References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Brian Davis wrote: > > Thanks for compiling this information. You are missing Sparrowhawk, the > Macintosh version of GED2HTML. See > > for more information. > > Brian Davis brdav1@comet.net AOL: BrianDavis > Home page: http://www.med.virginia.edu/~bd2m/ Oh my gosh! You're absolutely right. A *huge* mistake on my part. Alf, this is another one for the list. I'll have to get with you tomorrow with the other URL's you'll need (on my way to a parade right now, then a cookout and fireworks). Brian - thank you so much for the reminder. Pam Date: Thu, 04 Jul 1996 14:02:37 -0700 From: Pam Carey X-Mailer: Mozilla 2.0 (Win16; I) MIME-Version: 1.0 To: mavrogeorge@genealogysf.com CC: kygenweb-l@teleport.com, genweb@UCSD.EDU Subject: Re: GED >> HTML Generator List References: <199607041547.AA05522@relay.interserv.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit mavrogeorge@genealogysf.com wrote: > > On Thu, 04 Jul 1996, Pam Carey wrote: > >For those of you interested, here are the ged>>html generators > >that I have found so far. > > you left out Family Gatherings and Roots IV. I hope you don't mind that I'm sharing your message and my response with the groups. For now, I've been concentrating on compiling a list of stand-alone html generators. The reason I've compiled the list is to help groups such as KY GenWeb and US GenWeb, to name only two, reach a consensus on which product will help us best to solve the problem of linking remote databases, keeping in mind the range of tastes, system configurations, capabilities and resources available to a VERY large group of individuals. Adding these to the list, in my opinion, would draw us away from our focus. I don't claim to be a guru about all the commercial programs available. I'm familiar with these programs (interpret that as 'I've heard of them'), but I'm not aware of their ability to link remote databases. If they can do that, please get back with me with more information, and I'll be glad to add them to the list at that time. Pam Carey Durstock KY GenWeb Project Subject: Re: Proposed unique ID To: Jeff Murphy Date: Thu, 4 Jul 1996 19:55:05 +0100 (BST) From: Ben Laurie Cc: genweb@UCSD.EDU In-Reply-To: <1.5.4.32.19960704180347.006fdba8@mail.teleport.com> from "Jeff Murphy" at Jul 4, 96 11:03:47 am Reply-To: ben@algroup.co.uk X-Mailer: ELM [version 2.4 PL24 PGP2] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID: <9607041955.aa25547@gonzo.ben.algroup.co.uk> Jeff Murphy wrote: > > At 11:54 PM 7/3/96 -0400, George Waller at Home wrote: > >John Rigdon makes a very good proposal below. I would expand his ID > >with four more elements: Soundex for middle name, place of birth, > >place of death, and gender. > > > >So I would be: W525G333A12519460508IN??????????M > > > On 12 Jun 96 at 19:22, JohnR238@aol.com wrote: > > >> Here is my proposal which will generate an almost unique ID for > >> datasets where we have complete data, and for incomplete data will > >> still allow easy correlation. > >> > >> LAST NAME Soundex > >> First Name Soundex > >> BIRTHDATE > >> DEATHDATE > >> > >> Thus a complete id for me would be 24 characters > >> > >> R235J50019530818xxxxxxxx > > It seems to me that this could be simplified even further, by running a > program which would take your 33 character id and compress it. Out of > curiosity, I created a file containing only your id, above, and tried using > pkzip on it. Gave me 0% compression and stored it in a file of 139 > characters. :-) I guess that's not the way, but surely a way could be > devised.... For short things, compression only works with a fixed compression table. The usual way to proceed would be to create a large example base of codes, and generate Huffman codes from it. These would then be cast in stone and used by all programs dealing with IDs. However, as I've pointed out before, the thing about IDs is that they shouldn't change. These do not fit that bill. > > With the development of PAF 3.0, which will have ID Number and Ancestral > File Number in separate fields (I believe the gedcom standard will be 5.5 - > is that right?), and similar improvements already in effect in other > software, there will be a field in the gedcom available to carry this > information. > > Then the compressed field could be contained in the field designated for the > ID number. The only two problems will be to come up with a program which > will generate the ID number, and one which will be able to find the ids in > all gedcoms and index them. Then, as the html generator runs, it would have > to hit against this index to determine whether or not the id exists in more > than one gedcom. We're still limited by the access time over the Web. I think you are missing the point here. Conformant sites would have a method by which pages could be retrieved from their ID. In other words, if I have hold of an ID from a "Genweb Compliant Database" I should be able to do something like to link to the relevant page. Substitute your favourite combinations of :s, /s, ?s and so on for "http://somewhere.com/ByID/" of course. That's the way I see it, anyway. Cheers, Ben. > > Jeff Murphy 735 NW 8th Redmond, Oregon 97756 h. (541) 548-4478 > Specializing in the genealogy of Muhlenberg Co., Kentucky > USA GenWeb Project: http://www.teleport.com/~jmurphy/states.html > http://www.dsenter.com/lists/states.html to subscribe to mailing lists > -- Ben Laurie Phone: +44 (181) 994 6435 Freelance Consultant and Fax: +44 (181) 994 6472 Technical Director Email: ben@algroup.co.uk A.L. Digital Ltd, URL: http://www.algroup.co.uk London, England. Date: Thu, 04 Jul 1996 16:32:56 -0400 From: Henry Mendenhall X-Mailer: Mozilla 3.0b4 (Win95; I) MIME-Version: 1.0 To: genweb@UCSD.EDU Subject: auto-decompression with Apache web-server Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit There had been a discussion a week or two back on maintaining GEDCOM data in compressed form. In case it's of any use to anyone, I thought I'd post the strategy I use. It reduces the storage charges I pay to my ISP by more than %50. (There is a slight increase in the response time to users retrieving my pages, but the pages are small, and the delay for decompression is only a fraction of a second.) As far as I know, the following approach only works if you are working with a Unix Apache 1.1 server. In my case, I first compress ("gzip -9") all of the .html files produced by ged2html. (This is actually a bit tricky. You have to tell ged2html that it should use the suffix ".html.gz" in its construction of files so that links from its "PERSONS.html" file etc. have the correct URLs. But at that point you've got "html.gz" files that are not compressed, so you have to rename them all back to just ".html" then finally do the compression -- taking them to ".html.gz". (If this is too confusing, the good news is that this trickiness is only described here as help for ged2html users. The general technique of maintaining compressed files on an Apache server will work for anything you can get into compressed ".gz" form.) Then in each directory that has compressed files, I put a ".htaccess" file containing the two lines: -------------------------------------------------------------------------- AddType application/x-html-gzip gz Action application/x-html-gzip /cgibin/gzcat-html.cgi -------------------------------------------------------------------------- (In my case, I would have rather put "html.gz" to make the decompression specific to only ".html.gz" files. But there's a bug in Apache, so for now, any file ending in ".gz" will get passed to gzcat-html.cgi. Also, you'll want to substitute the name of your cgi directory (normally "cgi-bin") for "cgibin".) Finally, I installed a "gzcat-html.cgi" script in my cgi directory. (And of course I set the execution permission bits -- but all you unix-cgi people knew that already. :-)) The details would differ slightly depending on where your "gzcat" (or in some cases it's called "zcat") executable sits, but my 5 line version looks like: ----------------------------------------------------------------------------- #!/bin/sh . gedcom-defs echo Content-type: text/html echo ${GZipDir}/gzcat ${PATH_TRANSLATED} ---------------------------------------------------------------------------- Anyway, that's my approach for maintaining compressed genealogical data. hope it helps, -Henry Mendenhall hhm@mendenhall.org Date: 04 Jul 96 16:49:44 EDT From: N Oughtibridge <100020.1117@CompuServe.COM> To: GENWEB List Subject: uFTi now at Penn State Message-ID: <960704204944_100020.1117_EHV159-1@CompuServe.COM> The 32 bit version of uFTi, my GEDCOM to HTML program is now available at Penn State. The 16 bit version will be available very soon. I suggest reading the pages on the homepage, http://ourworld.compuserve.com/homepages/oughtibridge/ufti.htm before downloading. The URL for uFTi32.ZIP is: ftp://ftp.cac.psu.edu/pub/genealogy/windows/ufti32.zip If you have any comments or suggestions for improvements, please let me know. I would also appreciate it if you would let me know if you are successful and like the results. Nicholas Oughtibridge _________________ Nicholas Oughtibridge is the author of uFTi, a Windows program to generate World Wide Web pages from GEDCOM files. See HTTP://ourworld.compuserve.com/homepages/oughtibridge Email 100020.1117@compuserve.com Subject: Re: auto-decompression with Apache web-server To: Henry Mendenhall Date: Thu, 4 Jul 1996 22:52:15 +0100 (BST) From: Ben Laurie Cc: genweb@UCSD.EDU In-Reply-To: <31DC2A77.3AC4@cyberenet.net> from "Henry Mendenhall" at Jul 4, 96 04:32:56 pm Reply-To: ben@algroup.co.uk X-Mailer: ELM [version 2.4 PL24 PGP2] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID: <9607042252.aa25777@gonzo.ben.algroup.co.uk> Henry Mendenhall wrote: > > There had been a discussion a week or two back on maintaining GEDCOM data > in compressed form. In case it's of any use to anyone, I thought I'd > post the strategy I use. It reduces the storage charges I pay to my > ISP by more than %50. (There is a slight increase in the response time > to users retrieving my pages, but the pages are small, and the delay > for decompression is only a fraction of a second.) > > As far as I know, the following approach only works if you are working > with a Unix Apache 1.1 server. > > In my case, I first compress ("gzip -9") all of the .html files produced > by ged2html. > > (This is actually a bit tricky. You have to tell ged2html that it > should use the suffix ".html.gz" in its construction of files so > that links from its "PERSONS.html" file etc. have the correct URLs. > But at that point you've got "html.gz" files that are not compressed, > so you have to rename them all back to just ".html" then finally > do the compression -- taking them to ".html.gz". > > (If this is too confusing, the good news is that this trickiness is only described > here as help for ged2html users. The general technique of maintaining > compressed files on an Apache server will work for anything you can > get into compressed ".gz" form.) > > Then in each directory that has compressed files, I put a ".htaccess" file > containing the two lines: > > -------------------------------------------------------------------------- > AddType application/x-html-gzip gz > Action application/x-html-gzip /cgibin/gzcat-html.cgi > -------------------------------------------------------------------------- > > (In my case, I would have rather put "html.gz" to make the decompression > specific to only ".html.gz" files. But there's a bug in Apache, so for > now, any file ending in ".gz" will get passed to gzcat-html.cgi. Also, > you'll want to substitute the name of your cgi directory (normally "cgi-bin") > for "cgibin".) Hey! If there's a bug in Apache we (that is, me and the rest of the Apache Group) want to hear about it. We have just released 1.1. I don't like to discover that there are known bugs in it. Cheers, Ben. > Finally, I installed a "gzcat-html.cgi" script in my cgi directory. (And of > course I set the execution permission bits -- but all you unix-cgi people > knew that already. :-)) The details would differ slightly depending on where your > "gzcat" (or in some cases it's called "zcat") executable sits, but my 5 line version > looks like: > > > ----------------------------------------------------------------------------- > #!/bin/sh > > . gedcom-defs > > echo Content-type: text/html > echo > > ${GZipDir}/gzcat ${PATH_TRANSLATED} > ---------------------------------------------------------------------------- > > Anyway, that's my approach for maintaining compressed genealogical data. > > hope it helps, > > -Henry Mendenhall > hhm@mendenhall.org -- Ben Laurie Phone: +44 (181) 994 6435 Freelance Consultant and Fax: +44 (181) 994 6472 Technical Director Email: ben@algroup.co.uk A.L. Digital Ltd, URL: http://www.algroup.co.uk London, England. Date: Thu, 04 Jul 1996 19:54:39 -0400 From: Henry Mendenhall X-Mailer: Mozilla 3.0b4 (Win95; I) MIME-Version: 1.0 To: ben@algroup.co.uk CC: genweb@UCSD.EDU Subject: Re: auto-decompression with Apache web-server References: <9607042252.aa25777@gonzo.ben.algroup.co.uk> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Ben Laurie wrote: > > Henry Mendenhall wrote: > > ... > > (In my case, I would have rather put "html.gz" to make the decompression > > specific to only ".html.gz" files. But there's a bug in Apache, so for > ... > Hey! If there's a bug in Apache we (that is, me and the rest of the Apache > Group) want to hear about it. We have just released 1.1. I don't like to > discover that there are known bugs in it. > > Cheers, > > Ben. > ... Ben, I didn't mean to offend. I certainly appreciate the efforts of the Apache team. After all, without the Apache server and its advanced features, I wouldn't have a decent way to do the compression. Maybe I should have used the word "problem"? See , where it says ------------------------------------------------------- >... if you know of other >non-fatal problems that belong here, let us know. > >Please also check the known bugs page. > > 1.AddType only accepts one file extension per line, without any dots (.) in the >extension, and > does not take full filenames. If you need multiple extensions per type, use >multiple lines, e.g. > > AddType application/foo foo > AddType application/foo bar > > To map .foo and .bar to application/foo ------------------------------------------------------- As far as I can tell, this means I can't say "AddType application/x-compressed-html html.gz", which is what I really want so that only files like "PERSONS.html.gz" would get decompressed. -henry hhm@mendenhall.org P.S. Ben, I'm only cc'ing the genweb list on this reply so that I can publicly soften the word "bug" to "problem" (or maybe even "issue"). If you want to get into more detail about this, let's take the discussion off-line. From: JohnR238@aol.com Received: by emout16.mail.aol.com (8.6.12/8.6.12) id UAA22490 for genweb@ucsd.edu; Thu, 4 Jul 1996 20:15:33 -0400 Date: Thu, 4 Jul 1996 20:15:33 -0400 Message-ID: <960704201533_349098942@emout16.mail.aol.com> To: genweb@UCSD.EDU Subject: Re: Proposed unique ID In a message dated 96-07-04 00:05:31 EDT, you write: << John Rigdon makes a very good proposal below. I would expand his ID with four more elements: Soundex for middle name, place of birth, place of death, and gender. So I would be: W525G333A12519460508IN??????????M (making up the soundex codes, lazy fellow that I am :-) Suggest ?? for unknown rather than XX >> and later - of course there should be a country, state code. George's addition then gives us a 45 charater key and would uniquely identify almost anyone, even with incomplete data. I'm a bit concerned with the length this thing has become, but it's certainly not impossible to work with given the current technology. Would the country and state/province be the place of birth? - And are we becoming too comprehensive and representing a person's data, rather than just generating a unique key? I'm all for using a better soundex code. Can anyone elaborate of this? I know that AOL's designated screen names "soundex" to a number R238 for my last name rather than R235, but I've not seen any other than soundex numbers through 5. I've got about 3 million names in a file now. I think I'll run it against the file and see what kind of duplicates I get. John Rigdon From: JohnR238@aol.com Received: by emout17.mail.aol.com (8.6.12/8.6.12) id UAA01119 for genweb@ucsd.edu; Thu, 4 Jul 1996 20:15:11 -0400 Date: Thu, 4 Jul 1996 20:15:11 -0400 Message-ID: <960704201509_349098965@emout17.mail.aol.com> To: genweb@UCSD.EDU Subject: Re: Proposed unique ID In a message dated 96-07-04 13:34:59 EDT, George Waller wrote << The way I think of this is that we want to be able to link online databases using a unique ID. My sense is that the IDs cannot be fixed (as you indicate below). So, the linking mechanism (i.e. the URL in conjuction with the browser) needs to be somewhat intelligent. >> We are dealing with two different issues here. We need to (1) uniquely identify each person and then (2) represent where the data exists for that person. As was noted in an earlier thread, the data representation will for the forseeable future consist of not just URL's (virtual objects) but may be extended to designate actual objects in the real world. I think we have agreed that the current ISBN / LOC / BATCH-PAGE or other numbering schemes is inadequate. Irregardless of the method used to designate the location of "objects", that file needs to be maintained separate from the individual's id so that when an object moves, we don't end up with ambiguous or invalid individual id's So we'll end up with a code for each individual that is used in a second lookup table to find the current location of the object(s). John Rigdon Date: Thu, 4 Jul 1996 20:41:40 -0400 (EDT) Message-Id: <199607050041.UAA25343@Nimbus.CAM.ORG> X-Sender: beaur@pop.hip.cam.org X-Mailer: Windows Eudora Version 1.4.4 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: genweb@UCSD.EDU From: beaur@CAM.ORG (Denis Beauregard) Subject: Unique ID There are many things I would like to observe. Why an ID ? 1- To match 2 persons without human interventions 2- To travel from GEDCOM 1 at site 1 to GEDCOM 2 at site 2 My is 2 is the target. In this case, ID should not be based on an attempt to perform automatic matches between persons, but on a UNIQUE way of defining that person so that 2 genealogists entering data about the same person with suffisient data would match always, and if there is not enough data, automatic match should not be attempted. An indirect handle could be used when data is not enough. Minimum match: full name tag, birth place tag and date Alternative minimum match: a married couple with both full name tags, marriage place tag and date If this data is not found, then relative matching should be used. XXX could be labeled as the son of a fully-tagged couple. But the automatic GEDCOM link should not work for relative match when confusion is possible (i.e. parents of fully-tagged XXX is a correct link, but children of fully-tagged parents is not a correct link). Now, what is a full name tag ? IMO, it would be a unique standard name for a person name, independent of the language. For example, Dennis, Denis and Dionisio would have the same first name tag. Ditto for Elizabeth, Elisabeth and Betsy. Ditto for family names. Smith, Smythe and Schmidt would have the same family tag. Why this and not a SOUNDEX ? Soundex is a sorting help, not a matching help. Also, PRDH (a project of data entry for all Quebec vital records before 1800) is using this system after trying SOUNDEX and derivative systems. So, what we need now is: A central database with all possible first and family names, for a given context (the language in my opinion). That database would be maintain to have in column 1 any first or family name, and in column 2 the tag for that name, i.e. Daniel 123 Daniels 123 David 124 Denis 125 Dennis 125 Denys 125 There are a few cases where a name could have 2 handles, so a priority should be used (lower tag?). Also, the tag should be prefixed a language ID, that ID being actually that of the database for that language. We may define a minimum population of 50 millions to define such a database and language population smaller than that would be expected to be linguistically near another one to use the same tag (but with the proper names), i.e. northern slavic countries of Europe could use the same database; Catalans could use that of Spanish/Castillano. As for places, there are already codes for places that exist for many specific use. For example, there are codes for towns in France (INSEE), in each Canadian province (used for census, and 2 letters must be added for the province), in each US county. Those codes are not enough. They should cover the place from record (religious parish, civil place name, etc.). A format should be defined. I will now take my own example to see what would be the code. Full data: Denis Beauregard, born 16 MAY 1956 in Lachine, QC Suppose standard code for my first name is Denis FR1234 (given name based on the French names database) Beauregard FR4567 (family name based on the French names database) Lachine,QC QC65090.000 (Quebec town no. 65090) My full and unique ID would be: FR1234FR456719560516QC65090.000 Denis ### Denis Beauregard, genealogiste amateur, Internet: beaur@cam.org ### Page web de genealogie: http://www.cam.org/~beaur/gen/index.html ### Genealogy Web page: http://www.cam.org/~beaur/gen/welcome.html ### Sujets: Quebec, France, Acadie, experts francophones, etc. From: mavrogeorge@genealogysf.com Received: from [205.162.14.124] (sf-124.sfo.com) by relay.interserv.com with SMTP id AA15036 (5.67b/IDA-1.5 for genweb@ucsd.edu); Thu, 4 Jul 1996 17:46:31 -0700 Date: Thu, 4 Jul 1996 17:46:31 -0700 Message-Id: <199607050046.AA15036@relay.interserv.com> Mime-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit Subject: soundex To: genweb@UCSD.EDU X-Mailer: SPRY Mail Version: 04.00.06.21 I propose that the soundex code be composed of two elements - a character indicating which soundex method was used to generate the soundex followed by the code itself. Then rather than get into religious wars over which method is best developers can continue with old Russell one or the more modern ones like Metaphone, or the ATT routine, or Daitch-Mokotov, or the one for French, etc. From: mavrogeorge@genealogysf.com Received: from [205.162.14.124] (sf-124.sfo.com) by relay.interserv.com with SMTP id AA15168 (5.67b/IDA-1.5 for genweb@ucsd.edu); Thu, 4 Jul 1996 17:52:26 -0700 Date: Thu, 4 Jul 1996 17:52:26 -0700 Message-Id: <199607050052.AA15168@relay.interserv.com> Mime-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit Subject: Re: Proposed unique ID To: genweb@UCSD.EDU In-Reply-To: <960704201533_349098942@emout16.mail.aol.com> X-Mailer: SPRY Mail Version: 04.00.06.21 On Thu, 4 Jul 1996, JohnR238@aol.com wrote: >I'm all for using a better soundex code. Can anyone elaborate of this? I have a soundex tutorial at http://www.genealogysf.com it explains the Russell coding, Daitch-Mokotov, Guth, Metaphone, and other algorithms. Date: Thu, 04 Jul 1996 19:23:52 -0700 To: genweb@UCSD.EDU From: Jeff Murphy Subject: Re: Proposed unique ID Cc: GEDCOM-L@LISTSERV.NODAK.EDU At 07:55 PM 7/4/96 +0100, Ben Laurie wrote: >> ID number. The only two problems will be to come up with a program which >> will generate the ID number, and one which will be able to find the ids in >> all gedcoms and index them. Then, as the html generator runs, it would have >I think you are missing the point here. Conformant sites would have a method >by which pages could be retrieved from their ID. In other words, if I have hold >of an ID from a "Genweb Compliant Database" I should be able to do something >like to link to the >relevant page.... Well, yes, but how do you identify which "id goes here"? Somehow, we must identify which other individuals in other databases match the ones in our own database. We can't very well point to them if we don't know who they are. And dynamically generated html must be able to do it on the fly, unless that information is somehow provided. So in either case, there must be a process to identify matches in other databases over the web. One thing that will *not* work is some kind of preprocessor which each person runs against their online database. Why? Because as with the ID number, the gedcom (or more specifically, the available gedcom generators) have not provided for it. The LDS have processed 280 million names in their temples. I believe that most of them will be found in the LDS Ancestral File. Each of them have a unique id. Granted, there are a number of duplicates in that file, but each has a unique id, and they do it in 6 characters. Why not use that unique id for those individuals who already have them? It should work as well as any 33 character id we come up with. Let's assume, hypothetically, that the assignment of the id is not a problem. Let us further assume that we have in place a mechanism to generate a unique id for individuals on the web, and that this mechanism involves a master index showing the individual and the databases in which he/she occurs. What would be a working procedure to use the index? The question almost answers itself. I grew up in data processing in the era of Warnier-Orr and output-oriented design. Once you know the output that is desired, it is a relatively simple matter to design a system which accounts for the the processing required to produce the output, and therefore the input is also defined. In our case, the output is obvious: we want to be able to display to the user each occurrence of an individual in all the databases on the Web. So, what are the possible procedures that can get that for us? 1. We could have a program that replaces the existing gedcom generator for our genealogy software. As it generates the gedcom, it takes the id number *or* the AFN and hits against the master index database, pulls in all the matches, and inserts the necessary code into the gedcom. The html generator then has only to read the gedcom, and no further external processing has to be performed. 2. We could require that all the information regarding links be included in the notes using some like the LINK command, and the html generators would have to do the work of searching the index, finding the matches, and including them in the html. 3. The magic is done with Java, Applets, and Cotlets. (Right, I have no idea what I'm talking about.) And where do we want to place the responsibility for doing the real work? Everyone I've talked to about this, who has existing software, has said that they would expect the user to define the links, then pass the information to the html generator. But this is not an efficient use of the user's time. Without a new html generator that is willing to take on this responsibility, I think that we will all agree that the optimum place for this definition to occur would be prior to the generation of the html. I also think that it would best take place before the generation of the gedcom. (Only two choices: before or after.) So, if we create the links before the generation of the gedcom, we must decide how the links are created. Should each individual be responsible for searching out the duplicates, or do we create the hypothetical master index, maintain it, and hit against it? A manual or a computerized solution? I believe that only option 1, above, holds any promise for being able to meet the needs of a truly global solution to the problem. Jeff Murphy 735 NW 8th Redmond, Oregon 97756 h. (541) 548-4478 Specializing in the genealogy of Muhlenberg Co., Kentucky USA GenWeb Project: http://www.teleport.com/~jmurphy/states.html http://www.dsenter.com/lists/states.html to subscribe to mailing lists Date: Fri, 5 Jul 1996 01:13:27 -0400 (EDT) Message-Id: <199607050513.BAA11635@Nimbus.CAM.ORG> X-Sender: beaur@pop.hip.cam.org X-Mailer: Windows Eudora Version 1.4.4 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: mavrogeorge@genealogysf.com From: beaur@CAM.ORG (Denis Beauregard) Subject: Re: Unique ID Cc: genweb@UCSD.EDU >On Thu, 4 Jul 1996, beaur@CAM.ORG (Denis Beauregard) wrote: >>Now, what is a full name tag ? IMO, it would be a unique >>standard name for a person name, independent of the language. > >How do we handle English names which may or may not have >equivalence to the native language name. For example there isn't >general agreement on the English equivalents of some Greek first >names (drive me nuts!). You can have the one English language >first name that different people will say maps to two different >Greek first names. > >Your idea seems based on the idea that there is a one to one >relationship between a name and its "common" name. Not really. I proposed a prefix to indicate the language so that each names database would be independent. There is obviously some possibility of a common subset for first names (i.e. Dennis, Denis, Dionisio), and another set for not-translable names. Advantage is that if someone changes of country and genealogists translate the name (which is the case for example for a famous Detroit book about French settlers where all names are translated into English), then the code would be the same for the common subset. But I didn't think to a common subset when I wrote the message (perhaps, my example was not a good choice...) Denis ### Denis Beauregard, genealogiste amateur, Internet: beaur@cam.org ### Page web de genealogie: http://www.cam.org/~beaur/gen/index.html ### Genealogy Web page: http://www.cam.org/~beaur/gen/welcome.html ### Sujets: Quebec, France, Acadie, experts francophones, etc. Subject: Re: Proposed unique ID To: Jeff Murphy Date: Fri, 5 Jul 1996 15:34:43 +0100 (BST) From: Ben Laurie Cc: genweb@UCSD.EDU, GEDCOM-L@listserv.nodak.edu In-Reply-To: <1.5.4.32.19960705022352.008ec36c@mail.teleport.com> from "Jeff Murphy" at Jul 4, 96 07:23:52 pm Reply-To: ben@algroup.co.uk X-Mailer: ELM [version 2.4 PL24 PGP2] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID: <9607051534.aa28137@gonzo.ben.algroup.co.uk> Jeff Murphy wrote: > > At 07:55 PM 7/4/96 +0100, Ben Laurie wrote: > > >> ID number. The only two problems will be to come up with a program which > >> will generate the ID number, and one which will be able to find the ids in > >> all gedcoms and index them. Then, as the html generator runs, it would have > > >I think you are missing the point here. Conformant sites would have a method > >by which pages could be retrieved from their ID. In other words, if I have hold > >of an ID from a "Genweb Compliant Database" I should be able to do something > >like to link to the > >relevant page.... > > Well, yes, but how do you identify which "id goes here"? Somehow, we must > identify which other individuals in other databases match the ones in our > own database. We can't very well point to them if we don't know who they > are. And dynamically generated html must be able to do it on the fly, > unless that information is somehow provided. So in either case, there must > be a process to identify matches in other databases over the web. > > One thing that will *not* work is some kind of preprocessor which each > person runs against their online database. Why? Because as with the ID > number, the gedcom (or more specifically, the available gedcom generators) > have not provided for it. > > The LDS have processed 280 million names in their temples. I believe that > most of them will be found in the LDS Ancestral File. Each of them have a > unique id. Granted, there are a number of duplicates in that file, but each > has a unique id, and they do it in 6 characters. Why not use that unique id > for those individuals who already have them? It should work as well as any > 33 character id we come up with. > > Let's assume, hypothetically, that the assignment of the id is not a > problem. Let us further assume that we have in place a mechanism to > generate a unique id for individuals on the web, and that this mechanism > involves a master index showing the individual and the databases in which > he/she occurs. What would be a working procedure to use the index? The > question almost answers itself. > > I grew up in data processing in the era of Warnier-Orr and output-oriented > design. Once you know the output that is desired, it is a relatively simple > matter to design a system which accounts for the the processing required to > produce the output, and therefore the input is also defined. > > In our case, the output is obvious: we want to be able to display to the > user each occurrence of an individual in all the databases on the Web. > > So, what are the possible procedures that can get that for us? > > 1. We could have a program that replaces the existing gedcom generator for > our genealogy software. As it generates the gedcom, it takes the id number > *or* the AFN and hits against the master index database, pulls in all the > matches, and inserts the necessary code into the gedcom. The html generator > then has only to read the gedcom, and no further external processing has to > be performed. > > 2. We could require that all the information regarding links be included in > the notes using some like the LINK command, and the html generators would > have to do the work of searching the index, finding the matches, and > including them in the html. > > 3. The magic is done with Java, Applets, and Cotlets. (Right, I have no > idea what I'm talking about.) > > > And where do we want to place the responsibility for doing the real work? > > Everyone I've talked to about this, who has existing software, has said that > they would expect the user to define the links, then pass the information to > the html generator. But this is not an efficient use of the user's time. > Without a new html generator that is willing to take on this responsibility, > I think that we will all agree that the optimum place for this definition to > occur would be prior to the generation of the html. I also think that it > would best take place before the generation of the gedcom. (Only two > choices: before or after.) > > So, if we create the links before the generation of the gedcom, we must > decide how the links are created. Should each individual be responsible for > searching out the duplicates, or do we create the hypothetical master index, > maintain it, and hit against it? A manual or a computerized solution? > > I believe that only option 1, above, holds any promise for being able to > meet the needs of a truly global solution to the problem. I don't really see the difference between 1 & 2 above, except for timing. I don't understand why option 2 can't use the same source for the ID as option 1, either. Also, if links are to be up-to-date, they need to be generated at the moment the HTML is accessed. If there were a central, global database of UIDs (not an idea I'm hugely in favour of) then the obvious thing to do is to point at that database from your HTML, and have that database generate the current set of links. I don't actually see the advantage of a single database - so long as UIDs can be allocated in a way which ensures global uniqueness (which they can) and a mapping is available from a UID to the server which maintains that UID, then a distributed system would seem both practical and sensible. Cheers, Ben. > > Jeff Murphy 735 NW 8th Redmond, Oregon 97756 h. (541) 548-4478 > Specializing in the genealogy of Muhlenberg Co., Kentucky > USA GenWeb Project: http://www.teleport.com/~jmurphy/states.html > http://www.dsenter.com/lists/states.html to subscribe to mailing lists > -- Ben Laurie Phone: +44 (181) 994 6435 Freelance Consultant and Fax: +44 (181) 994 6472 Technical Director Email: ben@algroup.co.uk A.L. Digital Ltd, URL: http://www.algroup.co.uk London, England. Date: Fri, 05 Jul 1996 12:45:39 -0700 From: Pam Carey X-Mailer: Mozilla 2.0 (Win16; I) MIME-Version: 1.0 To: genweb@UCSD.EDU Subject: Re: Indexing References: <1.5.4.32.19960704180403.008cd358@mail.teleport.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Jeff Murphy wrote: > > At 10:43 AM 7/4/96 -0700, Pam Carey wrote: > >... 3) System Requirements 4) Static -vs- > >Dynamic (i.e. storage space req.) .... ...... snip ........ > Another question I had was regarding indexes. I would like to see the > necessary index entries for both Gene Stark's and John Rigdon's indexes > generated by whatever software product we end up choosing to use for > the US GenWeb Project. Can you list what indexes, if any, are produced > by the various software? Since I'm working on both of 1) expanding the list to include features (one of which will be the index file produced) and 2) testing each program on a very small gedcom I've created, I think the answer to this one will have to wait until I've had time to complete both. I'll list the index produced by the program as a feature, and then provide both Gene and John with access to the actual index file created, so they can test it themselves with their indexing programs. I don't have the capability to do that part. Pam Date: Fri, 05 Jul 1996 21:19:41 -0700 To: genweb@UCSD.EDU From: Jeff Murphy Subject: Test database John, I am pretty sure that, as with our various county and state projects, unless we start with a basic database, we will be unable to get anything going regarding the development of an interactive system to identify duplicate individuals. So I am going to propose that we begin with a set of 4 databases where I know duplicates exist. I believe they all used PAF to create the gedcom; we will have to confirm that. To put this into effect, we have to 1) adopt an algorithm which generates the individual web code; I suggest we use yours as currently defined, and alter as we see the need 2) write a program to hit those 4 gedcoms and extract the web codes to an indexed file 3) write a program to scan the index looking for duplicates, and generate html so we can bounce from one database to another looking at the data. This will give us an idea as to how good the index algorithm, since I know where most of the duplicates are between those databases. The four databases, containing about 110,000 names, are at: Dave Edwards Bill Couch Jeff Murphy Connie Mack Crawley I am assuming we can define the contents of the index file as 1-50 individual's web index key 51-150 URL to individual in database In my case, the following code could be used to get to me:
  • and I believe Dave Edwards' data would be at
  • The INDEX= parameter refers to the individual's record identification number. I am not quite sure how to access the other two databases; perhaps someone will suggest a way. Jeff Murphy 735 NW 8th Redmond, Oregon 97756 h. (541) 548-4478 Specializing in the genealogy of Muhlenberg Co., Kentucky USA GenWeb Project: http://www.teleport.com/~jmurphy/states.html http://www.dsenter.com/lists/states.html to subscribe to mailing lists Date: Sat, 06 Jul 96 18:31:11 0200 From: "alf.christophersen" Organization: Universitet i Oslo X-Mailer: Mozilla 1.1N (Windows; I; 16bit) MIME-Version: 1.0 To: durp@one.net, genweb@UCSD.EDU Subject: New style on the Gedcom to html review homepage Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii I have added some information in both English and Norwegian and added two forms to be used for updating the information on the page. The form for adding new programs is http://www.sn.no/disnorge/htmladd.htm and for corrections it is http://www.sn.no/disnorge/htmlcorr.htm Please take a look on our page. We hope to be able to add more english text information during the summer. -------------------------------------------------------------------- Alf Christophersen, Computer engineer Nordic School of Nutrition, PO Box 1046, Blindern, N-0316 Oslo Norway Tel. +47 22 85 13 27 Roots-L@mail.eworld.com list owner Editor of 'Slekt og Data', Quarterly organ of DIS-Norge, PO Box 146, Manglerud N-0612 Oslo URL: http://www.uio.no/~achristo soc.genealogy.nordic, no.slekt and no.slekt.programmer proponent >From home account Date: Sun, 07 Jul 1996 03:00:19 -0700 From: Pam Carey X-Mailer: Mozilla 2.0 (Win16; I) MIME-Version: 1.0 To: genweb@UCSD.EDU Subject: Update on progress of 'The List' Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Work is continuing on the list of GEDCOM to HTML converters. I'm doing this in 3 steps. Step one is complete. The list has been made available to all, thanks to Alf Christophersen. If you don't already know, the URL is http://www.sn.no/disnorge/htmlgeds.htm If you have or know of any programs not on the list yet, you can now submit them by visiting http://www.sn.no/disnorge/htmladd.htm New entries are still being accepted. Step 2 is in the works. I've compiled a list of features, filled in what I could, and polled all the authors to supply any missing information. Once I've received all the information, this will be added to the list. No time estimate yet. It depends on how long it takes to hear back from everyone. The responses so far have been very enlightening! *** Note: the list of features is not on the submission form yet, so if you do use that method to submit a new entry, please be sure to drop me a line at durp@one.net so that I can get the list to you for completion. Step 3 is to test each program by running them on a small gedcom I created for this purpose. The storage size requirements for each will be reported and published on the list page. I'm not sure yet whether Gene Stark and John Rigdon will test their indexing programs against them, but that is also a possibility. For now, while things are still 'in the works', I encourage all to check the list page often. I'll make an announcement as each step is completed, but it's possible that there could be new entries all through the process. Thanks for your patience. For me, it's back to work. Pam Carey Durstock KY GenWeb Project Date: Sun, 7 Jul 1996 18:02:46 -0600 From: smcgee@sol.slcc.edu (Scott McGee (Personal)) Message-Id: <9607080002.AA05085@sol.slcc.edu.> To: durp@one.net, genweb@UCSD.EDU Subject: Re: GED >> HTML Generator List Cc: kygenweb-l@teleport.com I have been online intermittently lately and so must have missed the information gathering phase of the GED>>HTML list. I have two entries, both based on Tom Wetmore's LifeLines and custom reports. The first is my GenWeb Server which is a dynamic system producing reports on demand directly from the LifeLines database. It supports inclussion (automatic) of notes and allows of embedding HTML in the database to be placed in the generated pages. It suports an image using a custom GEDCOM scheme I developed which is similar to the final GEDCOM DRAFT for Multimedia (I need to modify the reports to handle the draft method and then start changing my databases over.) This system already produces Gene Stark's GenDex form indexing info and can be addapted to produce other. My other software package (consisting of a lot of the same code, so producing pages very much like the GenWeb Server) takes a LL database and dumps HTML (static) pages. It supports most of what I mention above for the other package except the "on demand" part. It does not produce the ancestor and descendant pages (though it could - mostly a storage space issue). It can generate pages for an entire database, or can allow selection of a related set of people with inclusion of ancestors, descendants, and related lines to a user specified distance if desired. Since neither operates directly on GEDCOM, it can be debated if they should be included, but since both require LifeLines anyway, it is a simple single extra step to load the GEDCOM into a LifeLines database before using them. Both are availible for free. Go to my Genealogy Page at http://www.genealogy.org/~smcgee/ for more info. Scott PS Both are availibe to others for continued development or customization too. GENEALOGY | Do you know who your ancestors are? | Scott McGee -----------+---------------------------------------+--------------------- email: smcgee@genealogy.org | What? Me speak for web: http://genealogy.org/~smcgee/homepage.html | someone else? Nah! ---------------------------------------------------+--------------------- See my genealogy page at http://genealogy.org/~smcgee and my GenWeb page at http://genealogy.org/~smcgee/genweb From: mavrogeorge@genealogysf.com Received: from [207.33.216.26] (sf-026.sfo.com) by relay.interserv.com with SMTP id AA19428 (5.67b/IDA-1.5 for genweb@ucsd.edu); Sun, 7 Jul 1996 18:55:09 -0700 Date: Sun, 7 Jul 1996 18:55:09 -0700 Message-Id: <199607080155.AA19428@relay.interserv.com> Mime-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit Subject: Re: GED >> HTML Generator List To: genweb@UCSD.EDU In-Reply-To: <9607080002.AA05085@sol.slcc.edu.> X-Mailer: SPRY Mail Version: 04.00.06.21 On Sun, 7 Jul 1996, smcgee@sol.slcc.edu (Scott McGee (Personal)) wrote: >Since neither operates directly on GEDCOM, it can be debated if they should >be included, Nice to see your messages again. She told me she was concentrating on "stand alone" HTML generators. Your neat processes with LifeLines seem to fall into the category (unnamed) that includes Family Gatherings and Roots IV - HTML generators in or associated with genealogy software. Date: Sun, 7 Jul 1996 22:10:40 -0400 Message-Id: <9607080210.AA18006@mvjok.mv.att.com> Original-From: ttw@mvjok.mv.att.com To: mavrogeorge@genealogysf.com Cc: genweb@UCSD.EDU Subject: Re: GED >> HTML Generator List Content-Type: text LifeLines can easily serve as a stand alone HTML generator. The steps: Run LifeLines. When asked for the database to open, create a new one. Read in a GEDCOM file. Run any HTML generating program. Quit LifeLines. Remove the database. Write that up as a single script program and LifeLines is a stand alone HTML (or any other kind of output file) generator. Might be like opening a walnut with a sledge hammer, but if the shoe fits. Yours, Tom Wetmore, ttw@shore.net From list-relay@UCSD.EDU Sun Jul 7 20:06:45 1996 Received: from UCSD.EDU (mailbox1.ucsd.edu [132.239.1.53]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id UAA00965 for ; Sun, 7 Jul 1996 20:06:44 -0700 Received: from none.at.helo (bl-5.rootsweb.com [204.212.38.21]) by UCSD.EDU (8.7.5/8.6.9) with SMTP id TAA16980 for ; Sun, 7 Jul 1996 19:59:33 -0700 (PDT) Received: from bl-5.rootsweb.com (leverich@localhost [127.0.0.1]) by bl-5.rootsweb.com (8.6.12/8.6.9) with ESMTP id UAA14233; Sun, 7 Jul 1996 20:01:39 -0700 Message-Id: <199607080301.UAA14233@bl-5.rootsweb.com> To: genweb@UCSD.EDU cc: "Dr. Brian Leverich" reply-to: "Dr. Brian Leverich" Subject: Swap Genealogy Newsfeeds? Date: Sun, 07 Jul 1996 20:01:39 -0700 From: Brian Leverich Sorry about being off-topic, but GENWEB seems to be one of the spots where technically hairy folk hang out ... If anyone is a sysadmin at a site with a good Usenet newsfeed (or is friendly with such a sysadmin ... ), RootsWeb would very much like to swap a newsfeed with you. We are the host site of the s.g.african, s.g.methods, and (soon) s.g.surnames moderation teams, so by definition we have the best feeds available for those groups. We also send and receive feeds for alt.genealogy and the complete soc.genealogy.* hierarchy. We don't yet have a feed for the gatewayed FIDO echos, but we'd very much like to have one. The reason why we want to exchange feeds is to increase the speed and reliability of the propagation of genealogical posts through the Usenet. Only a few megabytes per day are involved and RootsWeb's news server is lightly loaded and quick, so exchanging feeds should have little impact on any of our NNTP peers. If you can swap feeds, please drop me a note. Thanks, B. -- Dr. Brian Leverich Co-moderator, soc.genealogy.methods/GENMTD-L RootsWeb Genealogical Data Cooperative leverich@rootsweb.com http://www.rootsweb.com/ Date: Mon, 08 Jul 1996 00:24:37 -0700 From: Pam Carey X-Mailer: Mozilla 2.0 (Win16; I) MIME-Version: 1.0 To: mavrogeorge@genealogysf.com CC: genweb@UCSD.EDU Subject: Re: GED >> HTML Generator List References: <199607080155.AA19428@relay.interserv.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit mavrogeorge@genealogysf.com wrote: > > On Sun, 7 Jul 1996, smcgee@sol.slcc.edu (Scott McGee (Personal)) > wrote: > >Since neither operates directly on GEDCOM, it can be debated if > >they should be included, > > Nice to see your messages again. She told me she was > concentrating on "stand alone" HTML generators. Did I neglect to add "non-commercial"? > Your neat processes with LifeLines seem to fall into the > category (unnamed) that includes Family Gatherings > and Roots IV - HTML generators in or associated with > genealogy software. ...................................................................... Excerpted from "LifeLines Basic Description" http://genealogy.org/~ttw/lines/basics.html LifeLines supports GEDCOM; it imports and exports GEDCOM data. LifeLines records are stored and edited in GEDCOM format. LifeLines does not enforce any specific GEDCOM standard; it can handle data in any GEDCOM form, and allows you to extend the standards for your own needs. ...................................................................... We deal with GEDCOMS. We're looking for GED to html generators. ---------------------------------------------------------------------- Excerpted from above-referenced message from Scott McGee: I have two entries, both based on Tom Wetmore's LifeLines and custom reports. ---------------------------------------------------------------------- The 'neither' you are referring to (above), are these. They are *based* on LifeLines. This message from Scott McGee was a very kind response to an inquiry that had been forwarded to him, on my behalf, in an attempt to determine whether or not his programs *should* be included in the list. Of course, you would have no way of knowing that. ======================================================================= Excerpted from "COMMSOFT: ROOTS IV" http://www.sonic.net/~commsoft/r4.html To Order ROOTS IV is available for $129.95, suggested retail, plus applicable sales tax and shipping. And from "COMMSOFT: Family Gatherings" http://www.sonic.net/~commsoft/famgath.html To Order Family Gathering is available from most major software outlets. Look for it in a store near you! Family Gathering is available from COMMSOFT for $49.95, suggested retail (plus applicable sales tax and shipping).Order your copy today! ======================================================================= I admit my lack of a good working knowledge of either of these programs, but I did take the time while visiting these pages to learn more about them. They sound like wonderful packages for the individual. However, I'm compiling this list of programs on behalf of the KY GenWeb and US GenWeb projects, to determine their suitability for incorporation into a project of this scale. As each county and state page administrator is a volunteer, it's not feasible at this time to require that they make such an investment as stated above. Of course, anyone is welcome to utilize this list for whatever purpose they see fit. It does not require, however, that their purpose become mine, nor that of the project. It was never my intention to debate this publicly, and I apologize to all of you subscribers who've had to muddle through this. Continued discussion on this topic would merely serve to provide an evening's enjoyment. I hope that this clears up any misunderstanding. Pam Carey Durstock KY GenWeb Project Date: 09 Jul 96 13:38:31 EDT From: N Oughtibridge <100020.1117@CompuServe.COM> To: "Scott McGee (Personal)" Cc: GENWEB List Subject: Re: GED >> HTML Generator List Message-ID: <960709173831_100020.1117_EHV55-1@CompuServe.COM> Scott What extensions to the GEDCOM spec are you using (so I can accommodate them too!) Nicholas Oughtibridge Date: Tue, 9 Jul 1996 18:56:32 -0400 Message-Id: <199607092256.SAA01847@mail.one.net> X-Sender: durp@mail.one.net (Unverified) X-Mailer: Windows Eudora Version 1.4.4 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: genweb@UCSD.EDU From: durp@one.net (Pam Carey) Subject: A Good Question Matt Brown (a new submission to the 'list') brought up a good question in a personal message to me yesterday, and I thought it should be shared with the group. > 1) You said you will be testing the various implementation with a > "small" GEDCOM file. How small is small? If the file contains only > a few people (<100), this will favor the GED2HTML type programs >(static converters), as their output size should vary directly with the > number of people/families (or size of the GEDCOM). Whereas, the > indexed GEDCOM implementations will vary like Ax+B, where A is > determined by the length of the index record(s) and B is the size of > the programs. The more people you test with, the less of a factor B > is. This had occurred to me, too, so I plan on reporting the findings not only in terms of total space required, but also as bytes-per-individual for the static generators and (bytes-per-individual + executables) for the dynamic. This may require 'testing the test' - using a second, larger, gedcom to see if I come up with the same factors. If there's any discrepancy, the formula can then be refined. This is the fairest way I could come up with. It would be a simple matter for anyone to apply the formula and know what kind of storage space would be required for any given gedcom (not just the test gedcom), and to determine the break-even point between any two of the programs being considered (in terms of the number of individuals contained in the gedcom). Thanks, Matt. This is something everyone should have been made aware of up front. I appreciate you bringing it to my attention. Pam Date: Fri, 12 Jul 1996 10:30:12 -0500 (CDT) From: Todd Tyrone Fries Sender: tfries@umr.edu Reply-To: Todd Tyrone Fries Subject: ...Unique ID To: genweb@UCSD.EDU Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII :There are many things I would like to observe. : :Why an ID ? : :1- To match 2 persons without human interventions :2- To travel from GEDCOM 1 at site 1 to GEDCOM 2 at site 2 : :My is 2 is the target. : :In this case, ID should not be based on an attempt to :perform automatic matches between persons, but on a :UNIQUE way of defining that person so that 2 genealogists :entering data about the same person with suffisient data :would match always, and if there is not enough data, :automatic match should not be attempted. An indirect :handle could be used when data is not enough. : :Minimum match: full name tag, birth place tag and date : :Alternative minimum match: a married couple with both full name :tags, marriage place tag and date : You're still, however, reverting back to an id format. For all practical purposes, computers removed from the equation, if two genealogists sit down to compare notes, so to speak, what do they do? First they check the person's name. Perhaps discovering that 'Aunt Kate' and 'Katherine' are the same person. They compare birth dates, if they both have the data, parents, children, spouses, etc.. Basically, they compare all info available and see if they have 'enough' info to make a match. Trying to make a cryptic key out of what will eventually become the entire database entry for a person is (imho) rather silly. Why not allow computers to do what they are good at -- manipulating data. It would be great to have a magic number for every person. Unfortunately, genetic code is not even unique per person. So what can we do? Ever hear of compression? Compression, for example the huffman algorithm, takes all available data, enumerates it, and assigns it 'bits'. These bits are then used later to revert to the original data. If one wishes to 'add' to the available data, one adds more bits. In universities, students get...student id's. They don't have cards that spell out their entire college career. They get a number. This translates into a database that contains their information. It doesn't have my initials or my date of birth or anything like that on my id number. It isn't necessary. So, my question is, why must we devise elaborite, complex schemes to make it 'humanly readable' while at the same time adding more and more data into a key that we wish to use only to identify a person? An id is ok for each database to have, but unless there is one central 'id assigning authority' there can not be unique keys assigned per person. Here is a suggestion. Have genweb (or some central genealogical authority) maintain a gedcom (or similar) database. Each person they identify as unique they give a new 'id' to. Say I bring in Matilda B. Bowers, and somebody else brings in Matilda C. Bowers, with similar but not quite the same data. Two numbers are assigned. In the event that I and the other person discover that the two people are instead one and the same, one of the id's can refer to the other, so as to not ever re-use an id, leaving a trail....in this way as ancestors are 'discovered', they can be enumerated...and every person recorded can be guaranteed to have a unique id. If you use your discussed method of 'generating' a unique id from the name and birthdate, etc, it is not guaranteed to work. I learned of two families in the town I grew up in, astonishingly two generations went by where the birthdates, last and first names, etc, were all the same. However, they were in no way (to their knowledge) related. Thus, I suggest 'giving up' on trying to include enough personal data about a person to guarantee a unique id, and instead focus on guaranteeing a unique id which references the person in a database. This would require a central (or perhaps distributed) authority that assigns numbers.. Perhaps something similar to the dns system we have today? Give each site that wishes to assign numbers a range to work with, and if they run out give them another range not already assigned? Or would it be better to have 1 location do everything? (hopefully not).. Anyway, to generate the id, we could use a 62 'digit' number system for the id's which could be (in regex terms) [a-zA-Z0-9] ... this could allow separator characters, etc, to be used, and leading zero's could be dropped so some lucky person could be '1' (zero should most likely be reserved for something useful), another could be '2', .. it could go in order as: 1 2 3 .... 9 a b c ... z A B C ... Z 10 11 ... I say 'could' because I am only suggesting things. Perhaps someone will notice that 64 is a good power of 2, and so designate a couple more characters to 'round things out'. Anyway, my $.02... -- Todd Fries .. todd@miango.com Subject: Re: ...Unique ID To: todd@miango.com Date: Fri, 12 Jul 1996 17:36:29 +0100 (BST) From: Ben Laurie Cc: genweb@UCSD.EDU In-Reply-To: from "Todd Tyrone Fries" at Jul 12, 96 10:30:12 am Reply-To: ben@algroup.co.uk X-Mailer: ELM [version 2.4 PL24 PGP2] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID: <9607121736.aa20405@gonzo.ben.algroup.co.uk> Todd Tyrone Fries wrote: > Perhaps something similar to the dns system we have today? Give each site > that wishes to assign numbers a range to work with, and if they run out give > them another range not already assigned? I am essentially in agreement with you, but... I think you are confusing DNS with IP assignment. One of the nice things about DNS is that once you have been assigned a name you can generate an infinite (or near enough) number of unique IDs from it. This is the basis of the proposal I have made many times, but just to make sure I really am boring everyone, I'll make it again. Here is a way of assigning unique IDs that doesn't involve _any_ new infrastructure: 1. Each site assigns local unique IDs in any way it chooses (call this an LUID). 2. The globally unique ID is then formed as follows: luid@sites.dns.name, or similar. That's all there is to it. We might like to debate details (like, for example, the '@' might be ill-advised as it can be confused with an email address) but the scheme is, IMO, sound, practical and easy to implement. In case that isn't clear, here's an example of one of my own GUIDs. I classify my database by researcher. Each researcher assigns their own database names and UIDs within them. The general form of my LUID is as follows: -:. I have various DNS names to choose from, but will probably use links.org if this scheme is ever accepted. So, the full GUID of my son, Felix Laurie von Massenbach is, as it happens: camilla-laurie:massen69@links.org See? Cheers, Ben. > > Or would it be better to have 1 location do everything? (hopefully not).. > > Anyway, to generate the id, we could use a 62 'digit' number system for > the id's which could be (in regex terms) [a-zA-Z0-9] ... this could allow > separator characters, etc, to be used, and leading zero's could be dropped > so some lucky person could be '1' (zero should most likely be reserved for > something useful), another could be '2', .. it could go in order as: > > 1 > 2 > 3 > .... > 9 > a > b > c > ... > z > A > B > C > ... > Z > 10 > 11 > ... > > I say 'could' because I am only suggesting things. Perhaps someone will > notice that 64 is a good power of 2, and so designate a couple more characters > to 'round things out'. > > Anyway, my $.02... > -- > Todd Fries .. todd@miango.com -- Ben Laurie Phone: +44 (181) 994 6435 Freelance Consultant and Fax: +44 (181) 994 6472 Technical Director Email: ben@algroup.co.uk A.L. Digital Ltd, URL: http://www.algroup.co.uk London, England. Date: Sun, 14 Jul 1996 21:45:52 -0400 From: Alexandra Steele X-Mailer: Mozilla 2.0 (Win95; U) MIME-Version: 1.0 To: genweb@UCSD.EDU Subject: Orangeburg SC Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit looking for information on the following families from the Orangeburg county area of SC: Bolen, Bell, Smith, Bull, Clark, Myers, Murphy, Brown please email me at alexsteele@earthlink.net Alexandra To: genweb@UCSD.EDU Cc: todd@miango.com (Todd Tyrone Fries) From: ghoffman@UCSD.EDU (Gary Hoffman) Organization: IR/PS UC San Diego, La Jolla CA 92093-0519 Date: Mon, 15 Jul 1996 22:25:00 PDT Subject: Re: ...Unique ID Todd Tyrone Fries,todd@miango.com,Internet writes: Thus, I suggest 'giving up' on trying to include enough personal data about a person to guarantee a unique id, and instead focus on guaranteeing a unique id which references the person in a database. This would require a central (or perhaps distributed) authority that assigns numbers.. Perhaps something similar to the dns system we have today? Give each site that wishes to assign numbers a range to work with, and if they run out give them another range not already assigned? Or would it be better to have 1 location do everything? (hopefully not).. ------------------ Todd, This brings us back full circle. This discussion began with the concern that a "number" or "ID" that consists of a server name plus a RIN-like number could break if either the server must change names (very common on the Internet) or the sponsor recompiles the HTML from an original database with the result that the record number changes. There are a couple of projects underway that address your proposal. First is the GenDex project, first proposed on this mailing list. Each GenWeb site registers its index, or GenDex, with a central site that compiles the indexes together into a master GenDex. Gene Stark has a good GenDex running at URL http://www.gendex.com. The master site checks periodically with each registered site to see if the local gendex.txt file has been updated and updates the master GenDex as appropriate. It looks like GenDex is scalable to include several million names. GenDex, however, does not resolve multiple instances of the same person. The other project is what is called the URN or Universal Resource Number. I have just written about this in the premier issue of the Journal of Online Genealogy. Rather than repeat myself here, I recommend you check it out at URL http://www.tbox.com/jog/jog.html. Cheers, Gary *************************************************************************** *Gary B. Hoffman, Computing Services Manager e-mail: ghoffman@ucsd.edu* *Graduate School of International Relations and Pacific Studies (IR/PS)* *University of California, San Diego (UCSD) voice: (619) 534-1989* *9500 Gilman Dr., La Jolla, CA 92093-0519 USA fax: (619) 534-3939* *************************************************************************** To: GenWeb@UCSD.EDU Cc: scottw@azalea.mirc.gatech.edu From: ghoffman@UCSD.EDU (Gary Hoffman) Organization: IR/PS UC San Diego, La Jolla CA 92093-0519 Date: Mon, 15 Jul 1996 22:46:36 PDT Subject: Robots Searching GenWeb Sites Well, GenWebbers, I guess we have hit the big time. I was surfing the Web last night, looking for some more information about a Robert Redford-produced movie called "The Dark Wind" that I saw on video this weekend. (A Tony Hillerman story that was not released in theaters.) Well, I did a Yahoo search for "Robert Redford" and, lo and behold, item number 3 on the list of found items was an entry for "Robert Redford" in Scott Wilkinson's GenWeb database. It turned out to be a stale link, but I eventually found the right entry there. (Scott's GenWeb site is indexed by GenDex.) This brings up an interesting issue: should we allow Web 'bots to access GenWeb databases? Before you answer, please read all about Web robots at URL http://info.webcrawler.com/mak/projects/robots/robots.html Cheers, Gary *************************************************************************** *Gary B. Hoffman, Computing Services Manager e-mail: ghoffman@ucsd.edu* *Graduate School of International Relations and Pacific Studies (IR/PS)* *University of California, San Diego (UCSD) voice: (619) 534-1989* *9500 Gilman Dr., La Jolla, CA 92093-0519 USA fax: (619) 534-3939* *************************************************************************** From: Brian Tompsett Date: Tue, 16 Jul 96 09:14:49 BST Message-Id: <17716.9607160814@olympus.dcs.hull.ac.uk> To: genweb@UCSD.EDU, ghoffman@UCSD.EDU Subject: Re: Robots Searching GenWeb Sites My nine (or so) genealogical databases are perpetually being indexed by robots. Many (if not most) of the accesses come from a robot generated index. This causes interesting question of database design and page format, viz the first page that people see is at the bottom of the tree and they almost never visit the root. You have to think carefully about where you place information (often navigational) that you wish them to see. Robots also cause a fair bit of problems to my server. Whatever the implementors think some of them are a nuisance. One in particular tries to fill the partition that contains my accesss log. I get about 6000 hits a day for genealogical requests. I log everything, and I study the log to learn about how people use the resource - that is in fact why I have the data there! I study how the data is accessed. Brian (http://www.dcs.hull.ac.uk/public/genealogy/GEDCOM.html) From: "W. Wesley Groleau (Wes)" Subject: Re: Robots Searching GenWeb Sites To: B.C.Tompsett@dcs.hull.ac.uk Date: Tue, 16 Jul 96 8:44:31 EST Cc: genweb@UCSD.EDU In-Reply-To: <17716.9607160814@olympus.dcs.hull.ac.uk>; from "Brian Tompsett" at Jul 16, 96 9:14 am Mailer: Elm [revision: 70.85] :> My nine (or so) genealogical databases are perpetually being indexed by robots. :> Many (if not most) of the accesses come from a robot generated index. :> ..... :> Robots also cause a fair bit of problems to my server. Whatever the :> implementors think some of them are a nuisance. One in particular tries to :> fill the partition that contains my accesss log. If you go to the URL that Gary posted, one of the links addresses this problem. If I remember right, there is an HTML tag that most robots will honor that tells them to stay out. (Or you could ask your server admin to block a non-compliant 'bot). --------------------------------------------------------------------------- W. Wesley Groleau (Wes) Office: 219-429-4923 Magnavox - Mail Stop 10-40 Home: 219-471-7206 Fort Wayne, IN 46808 elm (Unix): wwgrol@pseserv3.fw.hac.com --------------------------------------------------------------------------- From: mavrogeorge@genealogysf.com Received: from [207.33.216.15] (sf-015.sfo.com) by relay.interserv.com with SMTP id AA19957 (5.67b/IDA-1.5 for genweb@UCSD.EDU); Tue, 16 Jul 1996 06:56:14 -0700 Date: Tue, 16 Jul 1996 06:56:14 -0700 Message-Id: <199607161356.AA19957@relay.interserv.com> Mime-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit Subject: Re: Robots Searching GenWeb Sites To: genweb@UCSD.EDU In-Reply-To: <1996Jul15.224636.1387313@irpsbbs.ucsd.edu> X-Mailer: SPRY Mail Version: 04.00.06.21 On Mon, 15 Jul 1996, ghoffman@UCSD.EDU (Gary Hoffman) wrote: >This brings up an interesting issue: should we allow Web 'bots to access >GenWeb databases? YES! Date: Tue, 16 Jul 1996 10:29:09 EDT Message-ID: To: ghoffman@UCSD.EDU CC: genweb@UCSD.EDU, todd@miango.com In-reply-to: <1996Jul15.222500.1387299@irpsbbs.ucsd.edu> (message from Gary Hoffman on Mon, 15 Jul 1996 22:25:00 PDT) Subject: Re: ...Unique ID From: "Michael A. Patton" Reply-To: "Michael A. Patton, genealogy mail" From: Gary Hoffman Date: Mon, 15 Jul 1996 22:25:00 PDT The other project is what is called the URN or Universal Resource Number. An unfortunate choice of Acronym... The IETF is in the throws of standardizing something called a URN which, if it succeeds in its goals, will become the main way people access the web (they are an advanced form of URL), and thus most people will be familiar with this use of the acronym and another use specific to one field is likely to be confusing. BTW, for those who don't already know, URNs are designed to solve several of the problems this list has been trying to find ad hoc ways to address, these problems are not isolated to our specific regime. A page referenced by URN can move from server to server without the URN changing, for example. -MAP Date: Tue, 16 Jul 1996 10:43:32 EDT Message-ID: To: ghoffman@UCSD.EDU CC: GenWeb@UCSD.EDU, scottw@azalea.mirc.gatech.edu In-reply-to: <1996Jul15.224636.1387313@irpsbbs.ucsd.edu> (message from Gary Hoffman on Mon, 15 Jul 1996 22:46:36 PDT) Subject: Re: Robots Searching GenWeb Sites From: "Michael A. Patton" Reply-To: "Michael A. Patton, genealogy mail" From: Gary Hoffman Date: Mon, 15 Jul 1996 22:46:36 PDT This brings up an interesting issue: should we allow Web 'bots to access GenWeb databases? My theory on how to approach this is to use the robot control stuff(*) to limit the robot to specific pages. These would include some kind of overview and also a local index. I'd make these pages well linked into the descriptive info. If the index (etc.) pages that you let the robot look at are sufficiently well constructed, this should allow most searchers to find any of your pages, without the need for the robot to actually look at each individual page. This is, in fact, one of the ideas in the design of the robot control stuff. Unless I had an individual with some specific interesting bit, I'd block the individual genealogy pages. Reasons for including an individual page might be because they were "famous" and might more likely be the object of a search (but note that if the only thing that someone is likely to hit on is the name, then the index page should be sufficient for that), or because there is some amount of prose text that contains info that a searcher might be interested in. -MAP (*) on the page Gary referenced. I must admit I haven't looked at it recently, but I have read it in the past, thinking about this specific problem, and will probably do so again in the next couple of days as my home web server finally comes online... Date: Tue, 16 Jul 1996 11:32:16 -0400 (EDT) From: Denis Beauregard To: GenWeb@UCSD.EDU Subject: Re: Robots Searching GenWeb Sites In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII On Tue, 16 Jul 1996, Michael A. Patton wrote: > From: Gary Hoffman > Date: Mon, 15 Jul 1996 22:46:36 PDT > > This brings up an interesting issue: should we allow Web 'bots to access > GenWeb databases? > > My theory on how to approach this is to use the robot control stuff(*) > to limit the robot to specific pages. These would include some kind > of overview and also a local index. I'd make these pages well linked There is no "local index". >From what I read, there is a list of files to read or not (robot.txt ?) for the whole server, i.e. it is \robot.txt and otherwise, the main page (not necessary the home page) should include a tag to tell to the robots not to read the internal links. So, you could send your main page (entry of your pages) to the robot and then the robot would index only that page. > into the descriptive info. If the index (etc.) pages that you let the > robot look at are sufficiently well constructed, this should allow > most searchers to find any of your pages, without the need for the This may work for rare surnames, but if you are looking for a smith, it won't help at all... Moreover, the robot may take time to visit all pages and you may update the pages in the meanwhile and move names from one file to another. I have the problem of having my reference pages be visible from main search tools. Denis ### Denis Beauregard, genealogiste amateur, Internet: beaur@cam.org ### Page web de genealogie: http://www.cam.org/~beaur/gen/index.html ### Genealogy Web page: http://www.cam.org/~beaur/gen/welcome.html ### Sujets: Quebec, France, Acadie, experts francophones, etc. Subject: Re: ...Unique ID To: ghoffman@UCSD.EDU (Gary Hoffman) Date: Tue, 16 Jul 1996 19:36:34 -0500 (CDT) Cc: genweb@UCSD.EDU, todd@miango.com In-Reply-To: <1996Jul15.222500.1387299@irpsbbs.ucsd.edu> from "Gary Hoffman" at Jul 15, 96 10:25:00 pm From: todd@miango.com (Todd Fries) X-Home-Page: http://www.umr.edu/~tfries/ X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit > Todd Tyrone Fries,todd@miango.com,Internet writes: > Thus, I suggest 'giving up' on trying to include enough personal data about > a person to guarantee a unique id, and instead focus on guaranteeing a > unique id which references the person in a database. This would require a > central (or perhaps distributed) authority that assigns numbers.. > > ------------------ > Todd, > This brings us back full circle. This discussion began with the concern > that a "number" or "ID" that consists of a server name plus a RIN-like > number could break if either the server must change names (very common on > the Internet) or the sponsor recompiles the HTML from an original database > with the result that the record number changes. Hrm, I'll read up as you suggest, but my suggestion is to have 1 authority that starts with zero, 1, etc... and numbers people. The number should not have any bearings based on the server name or anything just the person. Why? Because the person is the only thing unique to the person. Should be obvious, but it's not. Some people think that just because a person was 'entered' into a server should forever tie their records with that server. But this doesn't allow a very distributed caching scheme of data or any of that. When I said similar to dns I meant very similar. Consider the fallacies of what you said: id = server + server-database-id If the server in this equation changes anything, including itself, or if the server goes out of business, or the database needs to change locations to another server, or someone wants to download the database and then serve the data from a local machine, what happns to the above equation? It breaks. horribly. The problem at hand is: We have, globally, alot of data about alot of people. How can we give a unique id to every person know to (have) exist(ed) ? Well, let's see, we assign them... a unique id? Then the problem becomes: How do we choose a unique id? Where do we store it? Who will asssign it? Well, the most logical way to assign a unique id seems to me to be to start counting. Just number the people. Perhaps use something other than base 10 to make the 'string' for each unique person short, but assign each person a number nevertheless. Where the id is stored is related to how it is assigned. Allow me to side-track a moment. Currently, we have a vast number of computers on internet. Each one has a unique id. Each machine is a separate entity that must be kept track of, for we all must be able to 'reach' it. The id's are assigned by first a central authority who gives authority to groups who are given numeric ranges with which to work. If the group needs more numbers, it gets another numeric range. There is never any duplication on internet. So why is it so hard to decide, hrm, we must give a unique id/number to each person so we can easily refer to them across all databases, so gendex, you get 1-1,000,000 and genweb, you get 1,000,001 - 2,000,000 , etc... so start counting. And if you discover you have the same person 'enumerated' twice? Well, somehow agree on a way to have 'alias' numbers such that even though a person receives two numbers, this actually refers to the same individual. It is perhaps 'not good' that a database should have multiple id numbers referring to an individual, but IMHO it is a tradeoff: multiple ids per person or multiple people per id, I think it is obvious which is preferred...because no matter how perfect a system you devise, there are always going to be those two people who submit a 'new' individual to two different databases, and that new individual happens to be the same person. I guess I must be missing something. What would be wrong with such a plan? -- Todd Fries .. todd@miango.com From: mavrogeorge@genealogysf.com Received: from [207.33.216.21] (sf-021.sfo.com) by relay.interserv.com with SMTP id AA10434 (5.67b/IDA-1.5 for genweb@ucsd.edu); Tue, 16 Jul 1996 20:54:36 -0700 Date: Tue, 16 Jul 1996 20:54:36 -0700 Message-Id: <199607170354.AA10434@relay.interserv.com> Mime-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit Subject: Re: Robots Searching GenWeb Sites To: genweb@UCSD.EDU In-Reply-To: X-Mailer: SPRY Mail Version: 04.00.06.21 On Tue, 16 Jul 1996, "Michael A. Patton" wrote: >Unless I had an individual with some specific interesting bit, I'd >block the individual genealogy pages. My comment on this would be that one never knows what is an "interesting bit" in the eyes of the surfer. The Greek genealogy pages I have mention specific villages on the island of Lesvos. I very much want my page to show up if a surfer does a search on that village's name. It also includes references to associations and organizations - again these are items I want to show up if someone is searching for them. I say let the 'bot index all of my site. Every try to find something in a genealogy reference or a town/state history and not see it in the index but upon reading something related to happen on the exact thing you wished had been in the index? Abusive 'bots are a topic in themselves and I might submit not entirely related to "should we let robots index our sites." Date: Tue, 16 Jul 1996 23:45:08 -0700 (PDT) From: Annelise Anderson To: genweb@UCSD.EDU Subject: Excluding Robots In-Reply-To: <1996Jul15.224636.1387313@irpsbbs.ucsd.edu> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Here's selected text from the URL Gary posted on robots, from the document on excluding robots. It's a little long, but to set up a file excluding the bots from some directories, all you need to read is the Examples section: A STANDARD FOR ROBOT EXCLUSION It is not an official standard backed by a standards body, or owned by any commercial organisation. It is not enforced by anybody, and there no guarantee that all current and future robots will use it. Consider it a common facility the majority of robot authors offer the WWW community to protect WWW server against unwanted accesses by their robots. The latest version of this document can be found on http://info.webcrawler.com/mak/projects/robots/robots.html. _________________________________________________________________ The Method The method used to exclude robots from a server is to create a file on the server which specifies an access policy for robots. This file must be accessible via HTTP on the local URL "/robots.txt". The contents of this file are specified below. The Format The format and semantics of the "/robots.txt" file are as follows: The file consists of one or more records separated by one or more blank lines (terminated by CR,CR/NL, or NL). Each record contains lines of the form ":". The field name is case insensitive. Comments can be included in file using UNIX bourne shell conventions: the '#' character is used to indicate that preceding space (if any) and the remainder of the line up to the line termination is discarded. Lines containing only a comment are discarded completely, and therefore do not indicate a record boundary. The record starts with one or more User-agent lines, followed by one or more Disallow lines, as detailed below. Unrecognised headers are ignored. User-agent The value of this field is the name of the robot the record is describing access policy for. If more than one User-agent field is present the record describes an identical access policy for more than one robot. At least one field needs to be present per record. The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended. If the value is '*', the record describes the default access policy for any robot that has not not matched any of the other records. It is not allowed to have two such records in the "/robots.txt" file. Disallow The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html. Any empty value, indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record. The presence of an empty "/robots.txt" file has no explicit associated semantics, it will be treated as if it was not present, i.e. all robots will consider themselves welcome. _________________________________________________________________ Examples The following example "/robots.txt" file specifies that no robots should visit any URL starting with "/cyberworld/map/" or "/tmp/: _________________________________________________________________ # robots.txt for http://www.site.com/ User-agent: * Disallow: /cyberworld/map/ # This is an infinite virtual URL space Disallow: /tmp/ # these will soon disappear _________________________________________________________________ This example "/robots.txt" file specifies that no robots should visit any URL starting with "/cyberworld/map/", except the robot called "cybermapper": _________________________________________________________________ # robots.txt for http://www.site.com/ User-agent: * Disallow: /cyberworld/map/ # This is an infinite virtual URL space # Cybermapper knows where to go. User-agent: cybermapper Disallow: _________________________________________________________________ This example indicates that no robots should visit this site further: _________________________________________________________________ # go away User-agent: * Disallow: / Author's Address Martijn Koster [End of selected text] My own robots.txt file looks like this: User-agent: * Disallow: /Test # a duplicate directory for testing and while the bots have visited, none has visited the /Test directory. Annelise