From list-relay@UCSD.EDU Wed Nov 8 20:40:00 1995 Received: from UCSD.EDU (mailbox1.ucsd.edu [132.239.1.53]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id UAA10360 for ; Wed, 8 Nov 1995 20:39:59 -0800 Received: from hoover.stanford.edu (hoover.Stanford.EDU [36.33.0.99]) by UCSD.EDU (8.6.12/8.6.9) with ESMTP id UAA02670 for ; Wed, 8 Nov 1995 20:37:08 -0800 Received: from HOOVER.STANFORD.EDU by HOOVER.STANFORD.EDU (PMDF V4.3-10 #13307) id <01HXEVOSG440003DJD@HOOVER.STANFORD.EDU>; Wed, 08 Nov 1995 20:36:43 -0800 (PST) Date: Wed, 08 Nov 1995 20:36:43 -0800 (PST) From: Annelise Anderson Subject: Re: (no subject) To: mbr@dadd.ti.com Cc: genweb@UCSD.EDU Message-id: <01HXEVOSGNEQ003DJD@HOOVER.STANFORD.EDU> X-VMS-To: IN%"mbr@dadd.ti.com" X-VMS-Cc: IN%"genweb@ucsd.edu",ANDRSN MIME-version: 1.0 Content-type: TEXT/PLAIN; CHARSET=US-ASCII Content-transfer-encoding: 7BIT From: IN%"mbr@dadd.ti.com" "Martin Roberts" 7-NOV-1995 17:05:19.42 To: IN%"jrothgteb@q.continuum.net", IN%"genweb@ucsd.edu" CC: Subj: (no subject) Martin Roberts wrote: >I have a general question about this with respect to Genweb: >Should we make available every name we have in our DB's, or only those we >have some confirmation for? I hate to think of all the people who might >ask questions about these connections by marriage when I really don't >know anything about the person except the name. I have more names like >this in my DB than I have names in the direct lines by about a factor >of 4! I did a calculation recently, with admittedly highly simplified assumptions, of the probability that any two people of European ancestry would have no ancestors in common at what we call the 11th generation (1024 possible ancestors). I'm never quite sure using probability theory whether I've got it right, but it came out to about .95, which I found surprising--that's a 5 percent chance of one or more ancestors in common at the 11th generation. Assuming both parties knew who all their ancestors were. Finding common ancestors--and thus information about other ancestors one doesn't yet know about--increases as more ancestors are identified. A possible measure of the depth of a data base is the sum of the known/possible ancestors in each generation. Through the 11th generation a perfect score would be 1 + 2/2 + 4/4 + ........ + 1024/1024 = 11 This weights recent ancestors more heavily than later ones; but if one is missing a recent ancestor, e.g., one's father, this reduces the possibilities in later generations by 1/2. Another measure is known ancestors/possible ancestors; total possible ancestors in the first 11 generations = 2047. My 80 known direct ancestors look a lot better by the first measure, which comes to 5.9 (of 11), than the second, 80/2047. If anyone else is to find this data base useful, breadth matters a great deal. I used to leave out people in whom I was not particularly interested, but I have concluded that they are useful to others in finding their ancestors (and will be a lot less likely to send me messages about possible relations if they can get to the actual data, or the report of it, than if I just post a list of surnames of interest). A measure of breadth might be people in the data base/direct ancestors in the data base. Of related non-direct ancestors, I think the most recently born are the least useful. Relatives by marriage in more distant generations may be very useful. Here's a case in point: A third great grandmother of mine, Susanna Itschner, married Johann Georg Denninger and immigrated to the United States with him and two infant daughters in 1832. Johann did not fare well on the voyage and died shortly after arrival. Susanna remarried twice; I know only their last names (Morlach and Granf). Assuming there were children of these marriages, there may well be Morlachs and Granfs out there who can find their third great grandmother--and her parents and grandparents, for I know a good deal about Susanna's ancestry; I've read the church records. So I've come to favor breadth in data bases--including just about everyone-- as useful to other people and, in making this information available without my personal intervention, a savings in time for me and others. Annelise From list-relay@UCSD.EDU Thu Nov 9 06:40:18 1995 Received: from UCSD.EDU (mailbox1.ucsd.edu [132.239.1.53]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id GAA12826 for ; Thu, 9 Nov 1995 06:40:17 -0800 Received: from gate.microware.com (gate.microware.com [198.17.151.51]) by UCSD.EDU (8.6.12/8.6.9) with SMTP id GAA14590 for ; Thu, 9 Nov 1995 06:38:09 -0800 Received: by gate.microware.com; id AA05432; Thu, 9 Nov 95 08:36:10 CST Received: from mcrware.microware.com(192.52.109.32) by gate.microware via smap (g3.0.1) id xma005418; Thu, 9 Nov 95 08:35:42 -0600 Received: from wales (wales.microware.com) by mcrware.microware.com with SMTP id AA27423 (5.67a8/IDA-1.5); Thu, 9 Nov 1995 08:37:25 -0600 From: Scott McGee Received: by wales id ; Thu, 9 Nov 95 08:37:22 CST Date: Thu, 9 Nov 95 08:37:22 CST Message-Id: <9511091437.AA02126@wales> To: mbr@dadd.ti.com, smcgee@microware.com Subject: Re: (no subject) Cc: genweb@UCSD.EDU Martin comments on the potential for a LOT of email queries to people about whom you have no more info than shown in your GenWeb pages. Yes, it is true that there is a potential, as this technology matures, to generate an overwhelming amount of email or other queries. Right now, however, the technology is barely in its infancy, and I feel there is plenty of time yet to worry about such issues. Trying to solve them now is bound to fail as we don't have a really clear idea where GenWeb is going to lead us yet. We are really still in the "tinkerer in the basement" stage. Outside our little community, nobody has even heard of it. For an example, in the last week and a half, I have had about 1500 queries into my database. So far, only one question about the people there, and about four messages about the the layout of the GenWeb pages themselves. Basically, I agree with your concerns, but feel that it is a little early to worry about such things yet. Scott If at first, you don't succeed, | smcgee@microware.com (Scott McGee) go fry a hen. After all, fried | ----------------------------------------- chicken beats failure any time. | I was paid $5.00 to express these views! -------------> http://www.cc.utah.edu/~sam8644/homepage.html <------------- From list-relay@UCSD.EDU Thu Nov 9 09:20:29 1995 Received: from UCSD.EDU (mailbox2.ucsd.edu [132.239.1.54]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id JAA13220 for ; Thu, 9 Nov 1995 09:20:28 -0800 Received: from crash.cts.com (crash.cts.com [192.188.72.17]) by UCSD.EDU (8.6.12/8.6.9) with SMTP id JAA04551 for ; Thu, 9 Nov 1995 09:08:44 -0800 Received: by crash.cts.com (Smail3.1.29.1 #5) id m0tDaSQ-00009vC; Thu, 9 Nov 95 09:08 PST Date: Thu, 9 Nov 1995 09:08:10 -0800 (PST) From: "V. Turner" To: Scott McGee cc: mbr@dadd.ti.com, smcgee@microware.com, genweb@UCSD.EDU Subject: How to limit excess email q's on genweb page In-Reply-To: <9511091437.AA02126@wales> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII On Thu, 9 Nov 1995, Scott McGee wrote: > Martin comments on the potential for a LOT of email queries to people about > whom you have no more info than shown in your GenWeb pages. > > Yes, it is true that there is a potential, as this technology matures, to > generate an overwhelming amount of email or other queries. Right now, however, Perhaps a simple disclaimer at the site would discourage all but the hardest of hard-core questioners. Something along the lines of: " Thank you for taking the time to browse my site. Compilation of this data has taken many hundreds, even thousands of hours. I hope it is of use to you in your own research. Please know that, at this time I simply *cannot* commit any additional hours in responding to *any* queries. If you have information that may help to update this data base, please feel free to send it. Someday I will be able to read it. However, the policy now is that no personal questions will be responded to until I hit the lottery :-) and hire a roomful of personal secretaries, assistants & and assorted go-fers. Until that time, thank you for your patience, and best of luck with *your* search." Fondest Woofers, * /\ -V * o---O~ \ / wizards@cts.com * >---/ \ ----------\/~~~~ "Be Resourceful. * * \ \ Be Resiliant. | * //( )____________|| You Are Brilliant." ____|__*|_____//_| |_______ //__||_____ * | * ___|_|____________||__*____|_ | * =====================|========= "In the Fall, Puppies Are Skiing In The Dandilions" From list-relay@UCSD.EDU Thu Nov 9 10:09:30 1995 Received: from UCSD.EDU (mailbox2.ucsd.edu [132.239.1.54]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id KAA13282 for ; Thu, 9 Nov 1995 10:09:29 -0800 Received: from mail1.digital.com (mail1.digital.com [204.123.2.50]) by UCSD.EDU (8.6.12/8.6.9) with SMTP id KAA07246 for ; Thu, 9 Nov 1995 10:11:54 -0800 Received: from vanna.ljo.dec.com by mail1.digital.com; (5.65 EXP 4/12/95 for V3.2/1.0/WV) id AA01223; Thu, 9 Nov 1995 10:07:01 -0800 Received: from csac by vanna.ljo.dec.com; (5.65/1.1.8.2/10Oct94-8.2MPM) id AA04795; Thu, 9 Nov 1995 13:04:45 -0500 Received: by csac.ljo.dec.com; id AA31845; Thu, 9 Nov 1995 13:06:58 -0500 Date: Thu, 9 Nov 1995 13:06:58 -0500 From: JimIsaak Message-Id: <9511091806.AA31845@csac.ljo.dec.com> To: genweb@UCSD.EDU Subject: Web Tools and locating names There is an emerging set of web tools for locating information on the web. These include "crawlers" or "worms" that go off looking for specific stuff, and also "inverted indexes" that suck in all of the pages from the web and index them. In either case, it is possible to put in a persons name and find ALL web pages (or close to it) that contain a reference to that name. More sophisticated queries could be made as well, which would reduce the "hit" space for popular names. However, these tools will not identify names that are "generated" (CGI-BIN style) from Forms or other tools that create "pages on demand". One element we may want to include in the set of GenWeb types of pages would be very simple (tinytafle?) pages that list names and a few key dates or locations so these other tools will be able to locate them on "static" pages. Jim Isaak From list-relay@UCSD.EDU Thu Nov 9 10:10:21 1995 Received: from UCSD.EDU (mailbox1.ucsd.edu [132.239.1.53]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id KAA13288 for ; Thu, 9 Nov 1995 10:10:21 -0800 Received: from mail.ucsd.edu (ucsd.edu [132.239.1.1]) by UCSD.EDU (8.6.12/8.6.9) with ESMTP id KAA05709 for ; Thu, 9 Nov 1995 10:12:47 -0800 Received: from chopin.ucc.hull.ac.uk by mail.ucsd.edu; id KAA19698 sendmail 8.6.12/UCSD-2.2-sun via ESMTP Thu, 9 Nov 1995 10:12:39 -0800 for Received: from mailhub.dcs.hull.ac.uk (actually host bertie.dcs.hull.ac.uk) by chopin with SMTP local (PP); Thu, 9 Nov 1995 18:09:50 +0000 Received: from olympus.dcs.hull.ac.uk by mailhub.dcs.hull.ac.uk with smtp (Smail3.1.29.1 #3) id m0tDbJk-0003rYC; Thu, 9 Nov 95 18:03 GMT From: Brian Tompsett Date: Thu, 9 Nov 95 18:00:30 GMT Message-Id: <13592.9511091800@olympus.dcs.hull.ac.uk> To: genweb@UCSD.EDU Subject: Re: How to limit excess email q's on genweb page A topic close to my own heart. As I have a set of data frequently consulted by all and sundry (Royal Lineage), I attract quite a lot of such email. Yes! People really do expect me to reply by return and tell them their own personal descent from Charlemagne, Edward I and other sundry people. I also get quite a few people telling me my data is no good as they are not in it. I have even been contacted by people who claim descent (or claim to be!) the Romanov's who escaped Ekerinberg. I am expected to remember, by heart, every name in the database and have them trip accross my tounge. More than I bargained for when I was just trying to do some Computer Science and look at data organisation, search and access methods, record identification, database performance and distribution. I often wonder wether to make my next published paper one on history or Computer Science. :-) Brian PS: I can spell, I just can't type :-) From list-relay@UCSD.EDU Thu Nov 9 10:39:19 1995 Received: from UCSD.EDU (mailbox1.ucsd.edu [132.239.1.53]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id KAA13374 for ; Thu, 9 Nov 1995 10:39:18 -0800 Received: from gate.ti.com (news.ti.com [192.94.94.33]) by UCSD.EDU (8.6.12/8.6.9) with ESMTP id KAA06887 for ; Thu, 9 Nov 1995 10:42:26 -0800 Received: from dad_sun.dadd.ti.com ([156.117.138.45]) by gate.ti.com (8.6.12/) with ESMTP id MAA14094 for ; Thu, 9 Nov 1995 12:42:25 -0600 Received: from 156.117.138.61 (mbr.dadd.ti.com [156.117.138.61]) by dad_sun.dadd.ti.com (8.6.10/8.6.10) with SMTP id MAA10760 for ; Thu, 9 Nov 1995 12:38:40 -0600 To: genweb@UCSD.EDU From: mbr@dadd.ti.com (Martin Roberts) Subject: Re: breadth - renamed from [no subject] Date: Thu, 9 Nov 1995 11:38:43 Message-ID: Thanks to Annelise, Anders, and Scott for their replies. Scott, I will be patient, but I work for a company where planning is emphasized and, in particular, gathering and understanding customer requirements. I hope you accept my concern in the sense of a knowledgeable customer's requirement. Anders, I need to explain better. I have many names in my data base that have no parents and no children. They are just names that I copied out of a book. The book lists lineages and gives the spouses for each generation as they are known, and details about the writers' direct line. I'm sure you are familiar with this sort of reference. I have no information about these names - they are just place holders in the DB. Of course if someone has a genealogy for one of these spousal families, I may be able to ask them if they have a marriage for person x to a person in my lineage, etc. Obviously that is good to have. But I mean those entries as pointers for me to use to ask other people. The entries for which I have information are my dB which I am willing to answer questions about. The purpose for storing these two types of name are different and I would like the difference recognized in any standardized interchange format like GEDCOM. I'll make the suggestion specific: If a name in a DB has no ancestors or descendents, it should be flagged as such with an identifier. This identifier should serve as a warning that there is no information in the DB about this person except possibly the spouse name. Annelise, I'm very interested in counting breadth. I counted directly the children of my 4 maternal ggparents, which is not a large family. I may not know of all the babies, but these four people had 5 children, 5 grandchildren, 13 ggchildren, 21gggchildren, and at least (so far) 8 ggggchildren. For my fathers family, which is larger, I don't know all of the relatives but what I know my paternal ggparents had 7 children, 13 gchildren, 19 ggchildren, 17 gggchildren, and 10 ggggchildren. But I'm missing several people on one branch of this family. Using this as an example, lets try to define a breadth factor as follows: Breadth(person, ancestor level, generation) = the number of people in a generation that are descendants of a common ancestor. where person is a name in a DB, ancestor level is the number of generations back, and generation is the generation number relative to the persons generation. For example, I have five children, so Breadth(Martin, 0, +1) = 5 Also, Breadth(Martin, -1, 0) = 1 Breadth (Martin, -2, 0) = 5 Breadth(Martin, -3, 0) = 19+13 = 32 For my wife, Breadth(Joan, -2, 0) > 30 She has a great aunt that has over 100 descendants. It would be interesting to know the statistical properties of this number. We would need statistical data on number of children born in a family during different time periods. Also the probability that any individual had any children, which is related to the survival rate to puberty for children, and the marriage rates. If we had these rate factors we could write a monte carlo simulator to estimate the breadth numbers. Fun for the financially secure intellectually challenged. Martin From list-relay@UCSD.EDU Thu Nov 9 12:46:30 1995 Received: from UCSD.EDU (mailbox1.ucsd.edu [132.239.1.53]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id MAA13747 for ; Thu, 9 Nov 1995 12:46:29 -0800 Received: from newton.ccs.tuns.ca (newton.ccs.tuns.ca [134.190.1.4]) by UCSD.EDU (8.6.12/8.6.9) with SMTP id MAA11345 for ; Thu, 9 Nov 1995 12:48:08 -0800 Message-Id: <199511092048.MAA11345@UCSD.EDU> Received: from parker.ced.tuns.ca by newton.ccs.tuns.ca with SMTP (1.37.109.6/15.6) id AA00107; Thu, 9 Nov 95 16:47:58 -0400 Sender: From: "Parker Barrington" Organization: TUNS, CED To: genweb@UCSD.EDU Date: Thu, 9 Nov 1995 17:46:29 +4000 Subject: Re: Web Tools and locating names Reply-To: parkerb@tuns.ca X-Confirm-Reading-To: parkerb@tuns.ca X-Pmrqc: 1 Priority: normal X-Mailer: Pegasus Mail/Windows (v1.22) This is similar to an idea I've been thinking of: The possibility of a robot that would do gedcom matching. A form could be set up at a site to collect certain information from interested genealogists including their email address and the URL of their gedcom. The gedcoms would have to be unzipped and therfore this may not work for large gedcoms but bear with me.. A program at that site could use the http protocol to "copy" this URL to a local file. The program would then check this gedcom against all other gedcoms in its database using a program similar to gensrch, that creates a report of its "match" findings. Each gedcom is individually copied and then deleted before "copying" the next gedcom. The problems of defining a "match" between gedcoms notwithstanding - perhaps we would have to settle for exact matches only to limit them. The program would also have to check that the submitted URLs are actually text files and that they are in proper gedcom format. The algorythm would have to keep track of which gedcoms have been checked against which gedcoms as well as a schedule for checking for updates to the gedcoms. A robot similar to the one used by the URL minder (http://www.netmind.com/URL-minder/URL-minder.html) could periodically check the URLs to see if there has been a change in the data. When a match has been found, the email, URL and details of the "match" could be emailed to each of the two URL owners. If comparisons are only made when the URLs are first submitted or when a change in a URL has been detected. The number of notifications should be few enough to handle and be useful. The advantages of this system include: 1. It could be fully automated. 2. Does not require a centralized database. Please, lets have some comments. Anyway, back to lurking..... -Parker. > There is an emerging set of web tools for locating information > on the web. These include "crawlers" or "worms" that go off looking > for specific stuff, and also "inverted indexes" that suck in all of > the pages from the web and index them. In either case, it is possible > to put in a persons name and find ALL web pages (or close to it) that > contain a reference to that name. More sophisticated queries could be > made as well, which would reduce the "hit" space for popular names. > > However, these tools will not identify names that are > "generated" (CGI-BIN style) from Forms or other tools that create > "pages on demand". > > One element we may want to include in the set of GenWeb > types of pages would be very simple (tinytafle?) pages that list > names and a few key dates or locations so these other tools will > be able to locate them on "static" pages. > > Jim Isaak > > Parker Barrington Technical University of Nova Scotia Continuing Education Division http://www.ced.tuns.ca From list-relay@UCSD.EDU Thu Nov 9 13:49:44 1995 Received: from UCSD.EDU (mailbox2.ucsd.edu [132.239.1.54]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id NAA13832 for ; Thu, 9 Nov 1995 13:49:43 -0800 Received: from relay-4.mail.demon.net (relay-4.mail.demon.net [158.152.1.64]) by UCSD.EDU (8.6.12/8.6.9) with SMTP id NAA16731 for ; Thu, 9 Nov 1995 13:48:13 -0800 Received: from post.demon.co.uk by relay-4.mail.demon.net id msg.ae26021; 9 Nov 95 21:44 GMT Received: from apusapus.demon.co.uk by relay-3.mail.demon.net id sg.aa22843; 9 Nov 95 21:44 GMT From: Trevor Jenkins Organisation: Don't put it down; put it away! To: Jeri Steele Date: Thu, 9 Nov 1995 21:38:56 +0000 Message-ID: <4660.tfj@apusapus.demon.co.uk> Subject: Re: more about MARC Reply-To: tfj@apusapus.demon.co.uk cc: finleyc@sonoma.edu, GEDCOM-L@vm1.nodak.edu, genweb@UCSD.EDU Priority: normal X-mailer: Pegasus Mail for Windows (v2.01) via wpkGate v2.01 > >On 2 Nov 95 at 14:34, Jeri Steele wrote: > > > >> If others start to do such projects, we may well see software > >> programs output MARC format files too! > > > >Trevor wrote: > > > >Oh I hope not. :-| To which Jeri said in her reply, which I need to read throughly before answering, that: > However, I brought up MARC not as a perfect example but one that > presents a new way of handling information. MARC format has been around for years. :-( I do not have the dates to hand but it has certainly been in use by the on-line host community and for on-line library catalogues since the early 70s maybe even before. I had to process MARC formatted tapes in 1978/79. Even the suggestion to use MARC for handling genealogical data is not new---I recall seeing another posting about that several years ago. I would also like to comment immediately upon: > I brought up the MARC format as an example of how to get SOURCE information > into a form that can be shared with others and viewed on the WWW. MARC is a binary format and GEDCOM is an textual format neither of which are handled by any of the WWW browsers available. Some brave souls have converted GEDCOM files into an equivalent HTML form so that all browsers can display. As to whether the latest draft version of GEDCOM is capable of recording source information I have not checked and I have stated, to GEDCOM-L, why I will not review this draft. My personal view, because I served on the ISO committee that ratified SGML, is that SGML can subsume the roles that you separately ascribe to GEDCOM and MARC. I said that to Bill Harten when I meet him several years ago (about the time of GEDCOM 5.0). Further, I believe that GEDCOM can adequately describe source information even in draft 5.1 by encoding it in NOTE element, which is how I record those details in my LifeLines database. Still, enough already; I will read the remainder of you arguments and comment after giving them the thought they deserve. Regards, Trevor. -- Procrastinate Now! From list-relay@UCSD.EDU Thu Nov 9 14:15:41 1995 Received: from UCSD.EDU (mailbox1.ucsd.edu [132.239.1.53]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id OAA13882 for ; Thu, 9 Nov 1995 14:15:41 -0800 Received: from gate.microware.com (gate.microware.com [198.17.151.51]) by UCSD.EDU (8.6.12/8.6.9) with SMTP id OAA14570 for ; Thu, 9 Nov 1995 14:16:08 -0800 Received: by gate.microware.com; id AA10555; Thu, 9 Nov 95 16:15:01 CST Received: from mcrware.microware.com(192.52.109.32) by gate.microware via smap (g3.0.1) id xma010550; Thu, 9 Nov 95 16:14:29 -0600 Received: from wales (wales.microware.com) by mcrware.microware.com with SMTP id AA07486 (5.67a8/IDA-1.5); Thu, 9 Nov 1995 16:16:12 -0600 From: Scott McGee Received: by wales id ; Thu, 9 Nov 95 16:16:09 CST Date: Thu, 9 Nov 95 16:16:09 CST Message-Id: <9511092216.AA04168@wales> To: genweb@UCSD.EDU, parkerb@tuns.ca Subject: Re: Web Tools and locating names Parker talks about a GEDCOM matching robot and says it solves the problem of having a central database. Parker, biggest problem I see is that what is it going to compare my GEDCOM with? It would have to collect GEDCOMs into a large database and do the comparisons with what it had collected so far. It would also have to flag collected GEDCOMs to determine if they needed replacing. Definately a non- trival problem. I like the idea, but am unsure how it would really work. Scott From list-relay@UCSD.EDU Thu Nov 9 15:19:51 1995 Received: from UCSD.EDU (mailbox1.ucsd.edu [132.239.1.53]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id PAA14007 for ; Thu, 9 Nov 1995 15:19:50 -0800 Received: from relay-4.mail.demon.net (relay-4.mail.demon.net [158.152.1.64]) by UCSD.EDU (8.6.12/8.6.9) with SMTP id PAA17053 for ; Thu, 9 Nov 1995 15:18:48 -0800 Received: from post.demon.co.uk by relay-4.mail.demon.net id sg.aa12400; 9 Nov 95 23:06 GMT Received: from apusapus.demon.co.uk by relay-3.mail.demon.net id msg.aa09328; 9 Nov 95 23:05 GMT From: Trevor Jenkins Organisation: Don't put it down; put it away! To: Jeri Steele Date: Thu, 9 Nov 1995 23:02:15 +0000 Message-ID: <4665.tfj@apusapus.demon.co.uk> Subject: Re: more about MARC (long reply) Reply-To: tfj@apusapus.demon.co.uk cc: GEDCOM-L@vm1.nodak.edu, GENWEB@UCSD.EDU, finleyc@sonoma.edu Priority: normal X-mailer: Pegasus Mail for Windows (v2.01) via wpkGate v2.01 > >On 2 Nov 95 at 14:34, Jeri Steele wrote: > > > >> If others start to do such projects, we may well see software > >> programs output MARC format files too! > > > >Trevor wrote: > > > >Oh I hope not. :-| Jeri then replied: > I agree its nice to have the ascii form to edit and view if we have > to go into a GEDCOM file and modify the information. > > However, I brought up MARC not as a perfect example but one that > presents a new way of handling information. Here's a couple of the > problems I have had with GEDCOM as a format: > > 1) By expressing all information in terms of people, dates, and relationships > we are showing the RESULTS of our genealogy research. GEDCOM and MARC are both interchange formats. The groups that promulgated them are different and the two stanards reflect the interests of these groups. However, this does not restrict the use of either standard for the purpose; that would be akin to say that C++ is a programming langauge and therefore cannot be used as a means of expressing an algorithm so that another person can study the algorithm. I find a contradiction here with the examples you give later. In both cases you concerned with results. Conventional wisdom says that GEDCOM records facts (e.g. names, places, events) but not the provenance of those facts. This is true of GEDCOM version 4.0, which is the current standard, but is not true of GEDCOM version 5.0 or later, which is a draft standard. The GEDCOM standard defines four concepts. Firstly, there is an abstract syntax for describing record linkage. This uses the cross-reference id feature to link arbitary parts of the data stream together. (In my reading of the standard I have never seen anything that suggests that I must restrict my definitions of pointer ids to level 1 lines.) I believe that the standard allows us to use the pointer concept as if it were a macro processor. Iterative substitution of a sub-structure for a cross-reference to it is acceptable. Secondly, there is a syntax for describing lineage-linking. This superimposes a partial genealogical view on the abstract syntax. Thirdly, there is a syntax for describing source provenances. This superimposes a partial informational view on the abstract syntax. It so happens that the informational view and the genealogical view of the abstract syntax are separate but cooperative. Fourthly, there is a particular set of tags defined. This restricts the use of GEDCOM to a genealogical application. (In my view this is a retrogressive step in GEDCOM version 4.0 there were no prescribed tags. As long as the syntax of a tag was adhered to a programmer, or GEDCOM creator, could invent tags for any purpose they desired.) The one thing that is missing from GEDCOM, and rightly so, is any concept of a distinguished record. Imagine picking up a lattice by an intersection---whichever one you choose I will choose a different one. In a GEDCOM file which contains a source records and lineage records you might favour the genealogical view, because you have accepted the conventional wisdom, where as I would favour either the genealogical view OR the informational view depending upon how I wish to interpret the data stram at that instant of time. > 2) There is no complete way to exchange source level information via GEDCOM. > (Not just a footnote or bibliographic citation, but complete source analysis!) Oh but I believe that there is! Whilst the GEDCOM 5.0+ standards prescribe which tags can be used there is a way to define additional tags. Unlike GEDCOM 4.0 where the name of tag could be "plucked" out of the air without any means of exchanging semantics, in the SCHEMA concept of 5.0+ the semantics are encoded in the data stream itself and not in the GEDCOM processor. > I brought up the MARC format as an example of how to get SOURCE information > into a form that can be shared with others and viewed on the WWW. As I wrote in my earlier reply to your message I do not believe that this example is valid. Either interchange format allows for interchange but neither of them can be viewed (directly) on the World-Wide Web. > For example, take a copy of the Probate Minutes on a specific case: > > The 'core' information can be abstracted, notes made about the > readability and accuracy of the information and its recorder. Then we > can infer additional information when we analyze this information in > conjunction with previous research AND in light of a particular research > goal. We will probably extract death and relationship information from > the probate document. But are you trying to prove a parent relationship, > the marriage event, or the death event? None. I would pick up this lattice with my informational spectacles on rather than with my genealogical ones. The events are the data extracted from the source but the source's provenance is there too. > With GEDCOM we are automatically thinking in terms of people, dates and places. No. This is merely because the conventional wisdom restricts our interpretation of the data stream. As genealogists we must think about the source information otherwise we are merely "train spotting". > With some system like MARC we would be emphasizing the abstraction of the sources > first. (Then recording notes, drawing conclusions, and finally recording the > actual and inferred information to be charted and displayed.) Neither MARC nor GEDCOM are systems. They are only interchange formats. Some applications accept GEDCOM, some accept MARC but it is WE who make the assumptions and give the emphasis. One of the weak points of MARC is that, unlike GEDCOM, there is no means to cross-reference records within the data stream. In MARC I can seen you a source record which is dissociated even from me. All you know about is it that you have a copy. GEDCOM on the other hand requires that the submitter be included in the data stream and that this is linked to the other records in the data stream. Also, GEDCOM allows us to exchange several sets of data which refer to each other. MARC does not offer either of these features. > If we exchange source information, notes, AND conclusions this is recording > more of the genealogy research process. Do you mean to say that the source information and notes have any relevance to a genealogist if they are separated from the (genealogical) conclusions? > If we were to start over designing > an exchange format today with the premise that abstracting and analyzing > source information is the MOST important part of the process, would it > look like GEDCOM? Why not? It could also look like (an extended version of) MARC. It could look like SGML. It could look like PostScript. The problem with both GEDCOM and MARC that SGML does not have is that if you wish to enhance the standard you must enhance the standard. In SGML there is no such limitation. I could add new elements, describing how they fit within the structure of the document, determining what rules are to be applied to verify that structure. One of the reasons that SGML caught on and ODA (Office Document Architecture) the standard that is it's near relation did not is precisely for this same reason. For every new format that you wanted to add into ODA you have to extend the standard. With SGML you simply extend the description of your document's structure. GEDCOM 5.0+ goes some of the way to providing with the SCHEMA concept but does not provide a formal model of the prescribed tags, which you use to define you additional tag. > Should we think about source information analysis > data exchange as a different problem covered by different exchange formats > rather than using GEDCOM to express this information? In my opinion, GEDCOM is better than MARC because it can be extended without the standard having to be rewritten. > ...In reality, I am only viewing the information from an > event that I have concluded occurred at a particular place, time and > with certain people participating. I am not recording pure source > information that can be passed on so that others can draw their own > conclusions. Taking an informational view of a GEDCOM data stream would allow others to do this with your data. > This isn't a limitation of of the program but in how > GEDCOM and genealogy programs are 'slanted' right now. Rather it is, as I have already said, solely because of the conventional interpretation of the GEDCOM standard. > Yes, I can > attach files that transcribe the documents or record my abstraction > notes. I want more. As someone who has criticised GEDCOM at great length (approximately 20 pages of comments on each of the version 5.0, 5.1, 5.2 and 5.3 drafts) I still say that it already provides you with what you want. > If I examine and abstract 20 deeds in order to solve one of my genealogy > problems, how do I exchange that information with others? They may want > to use the same abstracts to make a different conclusion (They maybe looking > at a different family for instance). I will harp on about it but GEDCOM already allows you to do this. > Is this a different genealogy application that doesn't exist in the > software market? Maybe a different view on the source data recorded in > GEDCOM is a solution? If you wish to take an evental (sic) view of genealogy then GEDCOM will handle that too. There have been genealogical application which adopted the event as the basic unit. > Carmen Finley's example of using MARC gives the GEDCOM and GENWEB > community some exposure and exploration at dissseminating source > information. I can't be the only 'serious' genealogist that has > been frustrated by a lack of a 'standard' way to share this source > information! If you are interested in source information then the international standards community, which gave you both MARC and SGML and should be giving you GEDCOM too, had and probably still has experts working on source standarisation. Regards, Trevor. -- Procrastinate Now! From list-relay@UCSD.EDU Thu Nov 9 15:19:58 1995 Received: from UCSD.EDU (mailbox2.ucsd.edu [132.239.1.54]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id PAA14013 for ; Thu, 9 Nov 1995 15:19:57 -0800 Received: from relay-4.mail.demon.net (relay-4.mail.demon.net [158.152.1.64]) by UCSD.EDU (8.6.12/8.6.9) with SMTP id PAA20398 for ; Thu, 9 Nov 1995 15:11:31 -0800 Received: from post.demon.co.uk by relay-4.mail.demon.net id sg.af12282; 9 Nov 95 23:05 GMT Received: from apusapus.demon.co.uk by relay-3.mail.demon.net id msg.aa09326; 9 Nov 95 23:05 GMT From: Trevor Jenkins Organisation: Don't put it down; put it away! To: parkerb@tuns.ca Date: Thu, 9 Nov 1995 21:50:30 +0000 Message-ID: <4664.tfj@apusapus.demon.co.uk> Subject: RCPT: Re: Web Tools and locating names Reply-To: tfj@apusapus.demon.co.uk Priority: normal X-mailer: Pegasus Mail for Windows (v2.01) via wpkGate v2.01 X-Pet-Peeve: It is anti-social to put read receipts on messages sent to mailing lists Confirmation of reading: your message - Date: 9 Nov 95 17:46 To: genweb@UCSD.EDU Subject: Re: Web Tools and locating names Was read at 21:50, 9 Nov 95. Regards, Trevor. -- Procrastinate Now! From list-relay@UCSD.EDU Thu Nov 9 15:34:27 1995 Received: from UCSD.EDU (mailbox1.ucsd.edu [132.239.1.53]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id PAA14079 for ; Thu, 9 Nov 1995 15:34:27 -0800 Received: from gate.ti.com (news.ti.com [192.94.94.33]) by UCSD.EDU (8.6.12/8.6.9) with ESMTP id PAA17688 for ; Thu, 9 Nov 1995 15:36:00 -0800 Received: from tilde.csc.ti.com ([128.247.160.56]) by gate.ti.com (8.6.12/) with ESMTP id RAA23240; Thu, 9 Nov 1995 17:35:57 -0600 Received: from ticipa.Works.Ti.Com (ticipa.works.ti.com [128.247.112.8]) by tilde.csc.ti.com (8.7.1/8.7.1) with SMTP id RAA19878; Thu, 9 Nov 1995 17:35:24 -0600 (CST) Received: from census.works.ti.com by ticipa.Works.Ti.Com with SMTP id AA07996 (5.65c/IDA-1.4.4); Thu, 9 Nov 1995 17:34:56 -0600 Received: by census.Works.Ti.Com (5.x/SMI-SVR4) id AA08891; Thu, 9 Nov 1995 17:34:58 -0600 Date: Thu, 9 Nov 1995 17:34:58 -0600 From: steele@ticipa.Works.ti.com (Jeri Steele) Message-Id: <9511092334.AA08891@census.Works.Ti.Com> To: tfj@apusapus.demon.co.uk Cc: GEDCOM-L@vm1.nodak.edu, GENWEB@UCSD.EDU In-Reply-To: <4665.tfj@apusapus.demon.co.uk> (message from Trevor Jenkins on Thu, 9 Nov 1995 23:02:15 +0000) Subject: Re: more about MARC (short reply to the long reply) Reply-To: steele@census.works.ti.com You obviously have more time on your hands than I do! >As I wrote in my earlier reply to your message I do not believe that >this example is valid. Either interchange format allows for >interchange but neither of them can be viewed (directly) on the >World-Wide Web. Well of course that can't be viewed directly on WWW, since HTML(SGML) IS the language of the Web! However, A number of libraries have used forms to feed the queries. I have neither the time nor energy to go point by point in detail on specific specifications. I just want to remind developers there IS more than one way to organize data. Enough! Jeri From list-relay@UCSD.EDU Thu Nov 9 20:16:57 1995 Received: from UCSD.EDU (mailbox2.ucsd.edu [132.239.1.54]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id UAA14606 for ; Thu, 9 Nov 1995 20:16:56 -0800 Received: from hoover.stanford.edu (hoover.Stanford.EDU [36.33.0.99]) by UCSD.EDU (8.6.12/8.6.9) with ESMTP id UAA01427 for ; Thu, 9 Nov 1995 20:12:59 -0800 Received: from HOOVER.STANFORD.EDU by HOOVER.STANFORD.EDU (PMDF V4.3-10 #13307) id <01HXG8VP2704003DBI@HOOVER.STANFORD.EDU>; Thu, 09 Nov 1995 20:12:33 -0800 (PST) Date: Thu, 09 Nov 1995 20:12:33 -0800 (PST) From: Annelise Anderson Subject: Re: breadth - renamed from [no subject] To: mbr@dadd.ti.com Cc: genweb@UCSD.EDU Message-id: <01HXG8VP2GNA003DBI@HOOVER.STANFORD.EDU> X-VMS-To: IN%"mbr@dadd.ti.com" X-VMS-Cc: IN%"genweb@ucsd.edu",ANDRSN MIME-version: 1.0 Content-type: TEXT/PLAIN; CHARSET=US-ASCII Content-transfer-encoding: 7BIT From: IN%"mbr@dadd.ti.com" 9-NOV-1995 11:56:41.34 To: IN%"genweb@ucsd.edu" CC: Subj: RE: breadth - renamed from [no subject] Martin Roberts writes: >I'll make the suggestion specific: > If a name in a DB has no ancestors or descendents, it should be > flagged as such with an identifier. This identifier should serve > as a warning that there is no information in the DB about this person > except possibly the spouse name. It would seem to be obvious that that's the case anyway. But I include these such people not because I'm interested in them--or in knowing more about them--but because they may be of help to other people in tying in to the people about whom I have some information. For example, the sister of my immigrant Graebner ancestor married someone named Sebastian Erhard. I include him because maybe there are Erhards out there who are descended from Sebastian Erhard and Regina Graebner, and my data base provides considerable information on Regina's ancestry; and his name is the hook by which they can tie in. On breadth, Martin, I was using the term in a difference sense--the breadth of a data base, not the number of related people in a generation (the reality, I guess the demographics). I was just speculating that: --The probability that we have ancestors in common is quite high --The direct-line ancestors--those who show up on an Ahnentafel-- are the ones I'm (most) interested in, but other people don't care about my ancestors, they care about their own --The breadth of the data base--how many people are included other than the direct-line ancestors--determines its usefulness to other people, because these additional surnames make it possible for people who share these ancestors to find them. In fact the number of people who have made data available beyond the TT format or the RAND list is not very great--Gene Stark's index is based on 50-some data bases (some people contributing more than one); I don't know how many Mickey Lane has; Cliff Manis's GENSERV had 1,100 contributors the last time he reported on it, as I recall. And there's overlap among these. What's the probability that in a data base other than your own, you would be able to find ancestors on the 'net? I don't know how to figure that out, although I imagine programs could be written to determine if there are any matching people (duplicates) in any of these lists. A measure of breadth--related people in a population at a given time--of the type Martin Roberts is talking about may be important in making such a calculation. In any event there are people inquiring about surnames in the newgroups who clearly have a great deal of data about their own families as well as some historic figures in whom they are interested, and they resent it when people say "send me everything you've got about X." They should be encouraged to get it on a web site and get it indexed. Then they would be less bothered by such inquiries and could simply respond to questions with "everything I know is already up there." >Annelise, I'm very interested in counting breadth. I counted directly >the children of my 4 maternal ggparents, which is not a large family. >I may not know of all the babies, but these four people had 5 children, >5 grandchildren, 13 ggchildren, 21gggchildren, and at least (so far) >8 ggggchildren. For my fathers family, which is larger, I don't know all >of the relatives but what I know my paternal ggparents had 7 children, >13 gchildren, 19 ggchildren, 17 gggchildren, and 10 ggggchildren. But I'm >missing several people on one branch of this family. Using this as an example, >lets try to define a breadth factor as follows: > Breadth(person, ancestor level, generation) = the number of people > in a generation that are descendants of a common ancestor. > where person is a name in a DB, ancestor level is the number of > generations back, and generation is the generation number relative to the > persons generation. >For example, I have five children, so Breadth(Martin, 0, +1) = 5 > >Also, Breadth(Martin, -1, 0) = 1 > Breadth (Martin, -2, 0) = 5 > Breadth(Martin, -3, 0) = 19+13 = 32 >For my wife, Breadth(Joan, -2, 0) > 30 > She has a great aunt that has over 100 descendants. >It would be interesting to know the statistical properties of this number. >We would need statistical data on number of children born in a family during >different time periods. Also the probability that any individual had any >children, which is related to the survival rate to puberty for children, and >the marriage rates. If we had these rate factors we could write a monte >carlo simulator to estimate the breadth numbers. >Fun for the financially secure intellectually challenged. >Martin From list-relay@UCSD.EDU Thu Nov 9 21:08:25 1995 Received: from UCSD.EDU (mailbox2.ucsd.edu [132.239.1.54]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id VAA14694 for ; Thu, 9 Nov 1995 21:08:24 -0800 Received: from roxy.sfo.com (roxy.sfo.com [205.162.14.50]) by UCSD.EDU (8.6.12/8.6.9) with ESMTP id VAA02612 for ; Thu, 9 Nov 1995 21:10:44 -0800 From: mavrogeorge@genealogysf.com Received: from 205.162.14.120 (sf-120.sfo.com [205.162.14.120]) by roxy.sfo.com (8.6.12/8.6.12) with SMTP id VAA04370 for ; Thu, 9 Nov 1995 21:10:31 -0800 Date: Thu, 9 Nov 1995 21:10:31 -0800 Message-Id: <199511100510.VAA04370@roxy.sfo.com> MIME-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit Subject: Re: How to limit excess email q's on genweb page To: genweb@UCSD.EDU In-Reply-To: X-Mailer: SPRY Mail Version: 04.00.06.17 Here is a big -IF-..... The tiny-tafel format allows you to indicate level of interest in exchanging information. If there was a high-level index (in tiny-tafel format) of all the GENWEB databases, then it would be apparent whether or not the link was interested in exchanging information. ... said it was a big if... From list-relay@UCSD.EDU Thu Nov 9 21:45:50 1995 Received: from UCSD.EDU (mailbox2.ucsd.edu [132.239.1.54]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id VAA14805 for ; Thu, 9 Nov 1995 21:45:49 -0800 Received: from UConnVM.UConn.Edu (uconnvm.uconn.edu [137.99.26.3]) by UCSD.EDU (8.6.12/8.6.9) with SMTP id VAA03414 for ; Thu, 9 Nov 1995 21:48:40 -0800 Message-Id: <199511100548.VAA03414@UCSD.EDU> Received: from ppp02p05.ucc.uconn.edu by UConnVM.UConn.Edu (IBM VM SMTP V2R2) with TCP; Fri, 10 Nov 95 00:48:29 EST Comments: Authenticated sender is From: "George Waller" To: genweb@UCSD.EDU Date: Fri, 10 Nov 1995 00:49:34 -0400 Subject: Value of genealogy Reply-to: gwaller@lib.uconn.edu Priority: normal X-mailer: Pegasus Mail for Windows (v2.01) > Martin Roberts writes: (mucho genealogical trivia snipped) > >Fun for the financially secure intellectually challenged. I gather Martin is self (and other) mocking for getting into this level of genealogy discussion. If so, my sense is that the feeling is universal. We have all said to ourselves "Gee, millions are suffering and here we are doing genealogy." It's a tough question... and we all know the answer. --George * George Waller, Univ of Connecticut, hbladm1@uconnvm.uconn.edu From list-relay@UCSD.EDU Fri Nov 10 00:59:55 1995 Received: from UCSD.EDU (mailbox2.ucsd.edu [132.239.1.54]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id AAA14984 for ; Fri, 10 Nov 1995 00:59:55 -0800 Received: from wrcd1.urz.uni-wuppertal.de (wrcd1.urz.Uni-Wuppertal.DE [132.195.20.13]) by UCSD.EDU (8.6.12/8.6.9) with SMTP id BAA06750 for ; Fri, 10 Nov 1995 01:00:30 -0800 Received: from wspo04.site.uni-wuppertal.de by wrcd1.urz.uni-wuppertal.de (5.61/1.34) id AA09949; Fri, 10 Nov 95 10:00:41 +0100 Date: Fri, 10 Nov 95 10:00:41 +0100 Message-Id: <9511100900.AA09949@wrcd1.urz.uni-wuppertal.de> X-Sender: wieneke2@wrcd1 X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable To: genweb@UCSD.EDU From: "Dipl.-Chem. A. Wieneke" Subject: Cliff Manis do anyone know something about Cliff Manis=B4s "Gene Server"?=20 http://soback.kornet.nm.kr/~cmanis/ CMANIS@Soback.Kornet.nm.kr before you can access the data, you have to send them a GEDCOM file and then after approx. 1 or 2 Monthes you can search his DB. =20 He writes: "The only way to receive an Access-Code for this server is to send us a GEDCOM Datafile. PLEASE NOTE: GEDCOM datafiles are NOT currently accepted if sent via email to GenServ. Please check the documentation for data submission procedures. " Andreas -------------------------------------------------------------------------- Dipl.-Chem. A. Wieneke d.: BUGH-Wuppertal p.:=20 Gausstrasse 20 Corellistrasse 36 D-42119 Wuppertal D-40593 D=FCsseldorf =20 Tel. +49 - 202 - 439 - 32 22 +49 - 211 - 7 39 49 68 FAX +49 - 202 - 439 - 20 68 +49 - 211 - 7 39 49 68 e-mail: wieneke2@wrcs3.urz.uni-wuppertal.de homepage: http://www.uni-wuppertal.de/fachbereiche/FB14/pohl/potitel.html. From list-relay@UCSD.EDU Sun Nov 12 10:41:22 1995 Received: from UCSD.EDU (mailbox2.ucsd.edu [132.239.1.54]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id KAA25344 for ; Sun, 12 Nov 1995 10:41:22 -0800 Received: from hoover.stanford.edu (hoover.Stanford.EDU [36.33.0.99]) by UCSD.EDU (8.6.12/8.6.9) with ESMTP id KAA13390 for ; Sun, 12 Nov 1995 10:40:25 -0800 Received: from HOOVER.STANFORD.EDU by HOOVER.STANFORD.EDU (PMDF V4.3-10 #13307) id <01HXJWVYECFK003CXB@HOOVER.STANFORD.EDU>; Sun, 12 Nov 1995 10:40:06 -0800 (PST) Date: Sun, 12 Nov 1995 10:40:06 -0800 (PST) From: Annelise Anderson Subject: Re: No Subject To: cdixon@turtle.apana.org.au Cc: genweb@UCSD.EDU Message-id: <01HXJWVYEVQA003CXB@HOOVER.STANFORD.EDU> X-VMS-To: IN%"cdixon@turtle.apana.org.au" X-VMS-Cc: IN%"genweb@ucsd.edu",ANDRSN MIME-version: 1.0 Content-type: TEXT/PLAIN; CHARSET=US-ASCII Content-transfer-encoding: 7BIT Here's a copy of a message from Gary Hoffman, who runs this list, on how to unsubscribe: Subj:You are tuned to GenWeb .... The purpose of this list is to facilitate the development of a linked, worldwide distributed genealogy database. If this topic is not of interest to you ...Here is how to unsubscribe: Send an e-mail message to listserv@ucsd.edu In the body of the message put the words: UNSUB GENWEB Do not reply to this message. Do not send these commands to genweb@ucsd.edu. Do not send me a message about unsubscribing. Just do it as outlined above. If you still want to read about the GenWeb, please point your WWW browser to the URL http://demo.genweb.org/genweblist/genweblist.html All current and archived messages are there for your perusal without cluttering your mailbox. Thanks, Gary *************************************************************************** *Gary B. Hoffman, Computer/Language Lab Director e-mail: ghoffman@ucsd.edu* *Graduate School of International Relations and Pacific Studies (IR/PS)* *University of California, San Diego (UCSD) voice: (619) 534-7733* *9500 Gilman Dr., La Jolla, CA 92093-0519 USA fax: (619) 534-5727* *************************************************************************** From list-relay@UCSD.EDU Mon Nov 13 06:15:33 1995 Received: from UCSD.EDU (mailbox2.ucsd.edu [132.239.1.54]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id GAA28562 for ; Mon, 13 Nov 1995 06:15:32 -0800 Received: from Mizar.DoCS.UU.SE (Mizar.DoCS.UU.SE [130.238.11.21]) by UCSD.EDU (8.6.12/8.6.9) with SMTP id GAA06792 for ; Mon, 13 Nov 1995 06:16:22 -0800 Received: by Mizar.DoCS.UU.SE (Sun-4/260, SunOS 4.0) with sendmail 5.61-bind 1.5+ida/ICU/DoCS/mizar id AA29247; Mon, 13 Nov 95 15:16:18 +0100 Date: Mon, 13 Nov 95 15:16:18 +0100 From: Anders Andersson Message-Id: <9511131416.AA29247@Mizar.DoCS.UU.SE> To: wizards@cts.com Subject: Re: How to limit excess email q's on genweb page Cc: genweb@UCSD.EDU [GenWeb project groups: User Interfaces, Information Quality] V Turner writes: > Perhaps a simple disclaimer at the site would discourage all but >the hardest of hard-core questioners. Something along the lines of: > > " Thank you for taking the time to browse my site. Compilation of >this data has taken many hundreds, even thousands of hours. I hope it is [etc...] Good thinking, but will it work? Consider the numerous unsubscription requests and requests for information about specific families being sent to the GenWeb mailing list in spite of Gary Hoffman's explicit instructions made available on the GenWeb WWW pages. I wouldn't label the people behind those misdirected messages "hard-core questioners". I'd guess that their working context is simply different from ours. I'm not saying that people generally ignore disclaimers when they see them, but rather that we need to think carefully about the procedures for providing feedback. A standard e-mail address to the maintainer of a particular set of records is easy to provide, but it may become dissociated from its context (including any disclaimers) and end up being used for more general queries than those intended (or totally irrelevant ones, such as freak commercial mass-mailings). Even when used from the proper context, the user may take that context for granted and not provide it to the maintainer, who may then have trouble finding out what documents the feedback refers to. Assuming that the user has access to a WWW browser including forms, one could set up a specific form for the kind of feedback which the data maintainer would like to encourage, including a limited number of alternatives and some predefined fields taken from the user's context (if someone presses the "feedback" button in a document describing the Williams family, the form should automatically supply the string "Williams" or some other unique identifier such as the URL with the message to the data maintainer). The point of this is that the user never sees any e-mail address to abuse. There are other problems with this solution, such as the issue of what to do for users in certain environments where e-mail works while HTML forms don't, but I think we can solve them too, if there is a demand for it and we get sufficient information about those environments. With more and more data maintainers competing for the users' attention, the load of unwanted mail to any single maintainer is likely to decrease, but at the same time we may expect more inexperienced users, pushing the figures in the other direction. This is not an urgent matter which needs to be solved now, but I think we should work on this in parallel with all the other parts which will make up the GenWeb. -- Anders Andersson, Dept. of Computer Systems, Uppsala University Paper Mail: Box 325, S-751 05 UPPSALA, Sweden Phone: +46 18 183170 EMail: andersa@DoCS.UU.SE From list-relay@UCSD.EDU Mon Nov 13 06:57:18 1995 Received: from UCSD.EDU (mailbox1.ucsd.edu [132.239.1.53]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id GAA28609 for ; Mon, 13 Nov 1995 06:57:16 -0800 Received: from Mizar.DoCS.UU.SE (Mizar.DoCS.UU.SE [130.238.11.21]) by UCSD.EDU (8.6.12/8.6.9) with SMTP id GAA29113 for ; Mon, 13 Nov 1995 06:55:27 -0800 Received: by Mizar.DoCS.UU.SE (Sun-4/260, SunOS 4.0) with sendmail 5.61-bind 1.5+ida/ICU/DoCS/mizar id AA29725; Mon, 13 Nov 95 15:55:21 +0100 Date: Mon, 13 Nov 95 15:55:21 +0100 From: Anders Andersson Message-Id: <9511131455.AA29725@Mizar.DoCS.UU.SE> To: isaak@ljo.dec.com Subject: Re: Web Tools and locating names Cc: genweb@UCSD.EDU [GenWeb project groups: Indexing, Data Maintenance] Jim Isaak writes: > However, these tools will not identify names that are >"generated" (CGI-BIN style) from Forms or other tools that create >"pages on demand". This is an incorrect conclusion due to oversimplification. The automatic tools you refer to have no access to information about the server's configuration beyond what is available to interactive user clients. It's true that no WWW crawler will fill out forms with random data and submit them simply to have the server return additional HTML pages, but it can't determine whether a particular URL returns the contents of a static page or causes a CGI program to be run (that's up to the server). > One element we may want to include in the set of GenWeb >types of pages would be very simple (tinytafle?) pages that list >names and a few key dates or locations so these other tools will >be able to locate them on "static" pages. I'd say that making the data available *only* via some form is a bad idea, unless you want to restrict access to users who know what they are looking for, or you simply have no natural point of entry into your data. Remember, forms (and buttons) imply that some "action" is taken on behalf of the user, while following a normal hypertext link should be considered "reading" existing material. We shouldn't destroy this distinction by using forms when it isn't necessary. Whether the HTML page containing the existing material sits on a disk or is generated on the fly is not obvious to the user, nor to the WWW crawler. However, having those generic crawlers search through your entire GenWeb database, perhaps causing your server's disk cache to fill up, isn't a good idea either. Decent crawlers are expected to obey explicit restrictions aimed at them and formed in accordance with a standard specification (I don't have a reference to it right now), and I think GenWeb server maintainers should employ such restrictions. Therefore, your suggestion is of interest anyway. I think it's a matter of whether we want the generic WWW crawlers to do the indexing for us, or we want to provide indices which are tailored particularly towards the GenWeb application. We can do both, of course. -- Anders Andersson, Dept. of Computer Systems, Uppsala University Paper Mail: Box 325, S-751 05 UPPSALA, Sweden Phone: +46 18 183170 EMail: andersa@DoCS.UU.SE From list-relay@UCSD.EDU Tue Nov 14 09:02:44 1995 Received: from UCSD.EDU (mailbox2.ucsd.edu [132.239.1.54]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id JAA03003 for ; Tue, 14 Nov 1995 09:02:43 -0800 Received: from gate.microware.com (gate.microware.com [198.17.151.51]) by UCSD.EDU (8.6.12/8.6.9) with SMTP id IAA12430 for ; Tue, 14 Nov 1995 08:45:56 -0800 Received: by gate.microware.com; id AA09510; Tue, 14 Nov 95 10:43:46 CST Received: from mcrware.microware.com(192.52.109.32) by gate.microware via smap (g3.0.1) id xma009493; Tue, 14 Nov 95 10:43:13 -0600 Received: from wales (wales.microware.com) by mcrware.microware.com with SMTP id AA00253 (5.67a8/IDA-1.5 for ); Tue, 14 Nov 1995 10:44:56 -0600 From: Scott McGee Received: by wales id ; Tue, 14 Nov 95 10:44:53 CST Date: Tue, 14 Nov 95 10:44:53 CST Message-Id: <9511141644.AA10346@wales> To: genweb@UCSD.EDU Subject: Re: Web Tools and locating names Anders Andersson writes: > >[GenWeb project groups: Indexing, Data Maintenance] > >Jim Isaak writes: >> However, these tools will not identify names that are >>"generated" (CGI-BIN style) from Forms or other tools that create >>"pages on demand". > >This is an incorrect conclusion due to oversimplification. >The automatic tools you refer to have no access to information >about the server's configuration beyond what is available to >interactive user clients. It's true that no WWW crawler will >fill out forms with random data and submit them simply to have >the server return additional HTML pages, but it can't determine >whether a particular URL returns the contents of a static page >or causes a CGI program to be run (that's up to the server). In other words, once a crawler finds my genweb page, it will find the link to my entry in my database, and from there manage to crawl over the entire database by going from one entry to another despite the fact that it is CGI served. >Therefore, your suggestion is of interest anyway. I think it's >a matter of whether we want the generic WWW crawlers to do the >indexing for us, or we want to provide indices which are tailored >particularly towards the GenWeb application. We can do both, of >course. I particularly like Gene Stark's GENDEX format for such indices. I provide a GENDEX index to each of the databases I maintain. Using Gene's index site, I have found many matches (yesterday a whole major branch of my own McGee Database that is also in the Doyle database!) and make links for them. One thing I can see, however, is that simple name and data matching is not sufficient. Many entries in the major branch I found yesterday have slightly different dates. They can be comfirmed to be the same by looking at parents, spouses, and children. I would love to see some web crawler or other electronic denzien that would crawl among Gene's index site entries and look for possible matches. Assign some value to a close/exact match on the following: name (soundex and spelling) birth date and place death date and place parents names spouse name(s) children's names It could then report to the database owners any entries above a given threshold value. This would have the added side benefit of encouraging others to provide gendex indexes to their own databases. I supose an index that listed the above items would be better for such searching, but it could be done easilly enough now with Gene's site. Scott Scott McGee | I do NOT want to be wo'd unto! -----------------------+--------------------------------------------------- I speak for myself. | email: smcgee@microware.com Your milage may vary! | web: http://www.cc.utah.edu/~sam8644/homepage.html -----------------------+--------------------------------------------------- Visit the ZION list homepage at http://www.cc.utah.edu/~sam8644/zion.html or at the shadow site: http://www.grfn.org/~smcgee/zion.html PS I though I posted this to the list the other day, but I ended up just sending it to Anders. Since I seem to be good at this, if any of you get what looks like a note to the whole list from me, but is addressed to just you in reply to something you posted, please send it back and feel free to tell me to get my act together and quit messing up! Thanks From list-relay@UCSD.EDU Tue Nov 14 21:07:49 1995 Received: from UCSD.EDU (mailbox2.ucsd.edu [132.239.1.54]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id VAA04754 for ; Tue, 14 Nov 1995 21:07:48 -0800 Received: from sbstark.cs.sunysb.edu (sbstark.cs.sunysb.edu [130.245.1.47]) by UCSD.EDU (8.6.12/8.6.9) with ESMTP id VAA25273 for ; Tue, 14 Nov 1995 21:07:47 -0800 Received: (from root@localhost) by sbstark.cs.sunysb.edu (8.6.12/8.6.9) with UUCP id AAA18134; Wed, 15 Nov 1995 00:05:23 -0500 Received: (from gene@localhost) by starkhome.cs.sunysb.edu (8.6.11/8.6.9) id VAA00601; Tue, 14 Nov 1995 21:44:25 -0500 Date: Tue, 14 Nov 1995 21:44:25 -0500 From: Gene Stark Message-Id: <199511150244.VAA00601@starkhome.cs.sunysb.edu> To: Scott McGee Cc: genweb@UCSD.EDU In-reply-to: Scott McGee's message of Tue, 14 Nov 95 10:44:53 CST Subject: Re: Web Tools and locating names References: <48b2ka$8s@starkhome.cs.sunysb.edu> >This would have the added side benefit of encouraging others to provide >gendex indexes to their own databases. I supose an index that listed the >above items would be better for such searching, but it could be done >easilly enough now with Gene's site. Searching on name, birth date/place, death date/place could indeed be done easily enough, but searching on parents, spouses, and children could not, as the GENDEX files (intentionally) contain no lineage-linking information. This information is stored in the individual databases themselves, in a format that varies from database to database. - Gene Stark From list-relay@UCSD.EDU Wed Nov 15 08:27:10 1995 Received: from UCSD.EDU (mailbox1.ucsd.edu [132.239.1.53]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id IAA07064 for ; Wed, 15 Nov 1995 08:27:09 -0800 Received: from gate.microware.com (gate.microware.com [198.17.151.51]) by UCSD.EDU (8.6.12/8.6.9) with SMTP id IAA02889 for ; Wed, 15 Nov 1995 08:24:08 -0800 Received: by gate.microware.com; id AA19234; Wed, 15 Nov 95 10:20:37 CST Received: from mcrware.microware.com(192.52.109.32) by gate.microware via smap (g3.0.1) id xma019229; Wed, 15 Nov 95 10:20:09 -0600 Received: from wales (wales.microware.com) by mcrware.microware.com with SMTP id AA03020 (5.67a8/IDA-1.5); Wed, 15 Nov 1995 10:21:49 -0600 From: Scott McGee Received: by wales id ; Wed, 15 Nov 95 10:21:47 CST Date: Wed, 15 Nov 95 10:21:47 CST Message-Id: <9511151621.AA11573@wales> To: gene@starkhome.cs.sunysb.edu, smcgee@microware.com Subject: Re: Web Tools and locating names Cc: genweb@UCSD.EDU Gene Stark writes: >>This would have the added side benefit of encouraging others to provide >>gendex indexes to their own databases. I supose an index that listed the >>above items would be better for such searching, but it could be done >>easilly enough now with Gene's site. > >Searching on name, birth date/place, death date/place could indeed >be done easily enough, but searching on parents, spouses, and children >could not, as the GENDEX files (intentionally) contain no lineage-linking >information. This information is stored in the individual databases >themselves, in a format that varies from database to database. Right, but a well trained robot could use your site as a starting place, and browse trough the various databases. Something like: Pick a GENDEX file Pick a person in that file Browse to that person and from there browse to and gather info on parents Compare gathered info with database of other gathered info Report any match found Store gathered info in database Continue The tricky part would be to teach the poor robot to understand the all the various formats well enough to be able to locate the parents and gather needed info (might be helpful to use the link to the parents, not to browse to them, but to identify them in the GENDEX file and take parent data from there.) Scott When in danger, | If it has my name on it, it must be MY opinion! or in doubt, |______________________________________________________ run in circles, | Email: smcgee@microware.com (Scott McGee) scream and shout! | Web: http://www.cc.utah.edu/~sam8644/homepage.html From list-relay@UCSD.EDU Wed Nov 15 12:18:38 1995 Received: from UCSD.EDU (mailbox2.ucsd.edu [132.239.1.54]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id MAA07959 for ; Wed, 15 Nov 1995 12:18:38 -0800 Received: from hoover.stanford.edu (hoover.Stanford.EDU [36.33.0.99]) by UCSD.EDU (8.6.12/8.6.9) with ESMTP id MAA21966 for ; Wed, 15 Nov 1995 12:18:10 -0800 Received: from HOOVER.STANFORD.EDU by HOOVER.STANFORD.EDU (PMDF V4.3-10 #13307) id <01HXO70UWYEO00417I@HOOVER.STANFORD.EDU>; Wed, 15 Nov 1995 12:17:51 -0800 (PST) Date: Wed, 15 Nov 1995 12:17:51 -0800 (PST) From: Annelise Anderson Subject: Re: Web Tools and locating names To: smcgee@microware.com Cc: genweb@UCSD.EDU Message-id: <01HXO70UY0ZM00417I@HOOVER.STANFORD.EDU> X-VMS-To: IN%"smcgee@microware.com" X-VMS-Cc: IN%"genweb@ucsd.edu",ANDRSN MIME-version: 1.0 Content-type: TEXT/PLAIN; CHARSET=US-ASCII Content-transfer-encoding: 7BIT I would think the place to start a search would be not with the gendex.txt files, but with the surname files. They seem to include all the information from the gendex files. A robot could get them and sort them in various ways, e.g. by date of birth etc. An initial screen would be whether the names come from more than one database; if they don't, there are no duplicate people. The program would decide if there are possible matches-- and then go back to the surname index and from there get the data from the specific web sites. Annelise From list-relay@UCSD.EDU Wed Nov 15 20:14:04 1995 Received: from UCSD.EDU (mailbox1.ucsd.edu [132.239.1.53]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id UAA09078 for ; Wed, 15 Nov 1995 20:14:03 -0800 Received: from gate.microware.com (gate.microware.com [198.17.151.51]) by UCSD.EDU (8.6.12/8.6.9) with SMTP id UAA00151 for ; Wed, 15 Nov 1995 20:15:37 -0800 Received: by gate.microware.com; id AA26309; Wed, 15 Nov 95 22:13:31 CST Received: from mcrware.microware.com(192.52.109.32) by gate.microware via smap (g3.0.1) id xma026307; Wed, 15 Nov 95 22:13:15 -0600 Received: from wales (wales.microware.com) by mcrware.microware.com with SMTP id AA22520 (5.67a8/IDA-1.5); Wed, 15 Nov 1995 22:14:56 -0600 From: Scott McGee Received: by wales id ; Wed, 15 Nov 95 22:14:54 CST Date: Wed, 15 Nov 95 22:14:54 CST Message-Id: <9511160414.AA12492@wales> To: ANDRSN@hoover.stanford.edu, smcgee@microware.com Subject: Re: Web Tools and locating names Cc: genweb@UCSD.EDU Well, after talking about it just this morning, I did it agian (I think) and posted a reply to Annelise alone. Let me try again. Annelise, I agree with you that your thoughts would be an improved search method, but must point out that surname indecise are not universal, even among sites indexed at Gene's site. The only two things that are common to all such sites is the fact that a web page yeilds info on people, and a GENDEX.txt file of those people. (Actually, this impacts my method too, but then I had already mentioned that problem) Outside of the sites Gene indexes, little can be assumed to be common. We can therefor define a standard by which sites must comply to be indexed by the robot, or index what we can with what we have already. I tend to think that like Gene's index site itself, a robot indexer would be an experiment, and thus using sites registered at Gene's site (thus sharing enough for my method) would be a valid place to start the experiment. If someone does create such a robot, and decides to use some other required information to index the site, I will happily modify my software to provide that information. I just wanted to point out what we could do with existing information. Scott If at first, you don't succeed, | smcgee@microware.com (Scott McGee) go fry a hen. After all, fried | ----------------------------------------- chicken beats failure any time. | I was paid $5.00 to express these views! -------------> http://www.cc.utah.edu/~sam8644/homepage.html <------------- From list-relay@UCSD.EDU Wed Nov 15 23:35:37 1995 Received: from UCSD.EDU (mailbox2.ucsd.edu [132.239.1.54]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id XAA09378 for ; Wed, 15 Nov 1995 23:35:37 -0800 Received: from hoover.stanford.edu (hoover.Stanford.EDU [36.33.0.99]) by UCSD.EDU (8.6.12/8.6.9) with ESMTP id XAA14930 for ; Wed, 15 Nov 1995 23:33:50 -0800 Received: from HOOVER.STANFORD.EDU by HOOVER.STANFORD.EDU (PMDF V4.3-10 #13307) id <01HXOUIRJRUO003ZJE@HOOVER.STANFORD.EDU>; Wed, 15 Nov 1995 23:33:35 -0800 (PST) Date: Wed, 15 Nov 1995 23:33:35 -0800 (PST) From: Annelise Anderson Subject: Re: Web Tools and locating names To: smcgee@microware.com Cc: genweb@UCSD.EDU Message-id: <01HXOUIRKB4Y003ZJE@HOOVER.STANFORD.EDU> X-VMS-To: IN%"smcgee@microware.com" X-VMS-Cc: IN%"genweb@ucsd.edu",ANDRSN MIME-version: 1.0 Content-type: TEXT/PLAIN; CHARSET=US-ASCII Content-transfer-encoding: 7BIT Scott says: >I agree with you that your thoughts would be an improved search method, but >must point out that surname indecise are not universal, even among sites >indexed at Gene's site. The only two things that are common to all such sites >is the fact that a web page yeilds info on people, and a GENDEX.txt file of >those people. (Actually, this impacts my method too, but then I had already >mentioned that problem) >Outside of the sites Gene indexes, little can be assumed to be common. We can >therefor define a standard by which sites must comply to be indexed by the >robot, or index what we can with what we have already. I tend to think that> >like Gene's index site itself, a robot indexer would be an experiment, and >thus using sites registered at Gene's site (thus sharing enough for my method) >would be a valid place to start the experiment. >If someone does create such a robot, and decides to use some other required >information to index the site, I will happily modify my software to provide >that information. I just wanted to point out what we could do with existing >information. Scott, I was not thinking of anything so complicated. I was thinking about a way to find identical people in the data bases that are being indexed by Gene, not any other data bases. It just seemed reasonable to start with a list sorted by surnames already; take the surname list and write a program to find possible identities based on name or date of birth or whatever. Annelise From list-relay@UCSD.EDU Thu Nov 16 06:39:21 1995 Received: from UCSD.EDU (mailbox1.ucsd.edu [132.239.1.53]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id GAA11537 for ; Thu, 16 Nov 1995 06:39:20 -0800 Received: from gate.microware.com (gate.microware.com [198.17.151.51]) by UCSD.EDU (8.6.12/8.6.9) with SMTP id GAA12154 for ; Thu, 16 Nov 1995 06:36:57 -0800 Received: by gate.microware.com; id AA28635; Thu, 16 Nov 95 08:34:51 CST Received: from mcrware.microware.com(192.52.109.32) by gate.microware via smap (g3.0.1) id xma028633; Thu, 16 Nov 95 08:34:23 -0600 Received: from wales (wales.microware.com) by mcrware.microware.com with SMTP id AA03883 (5.67a8/IDA-1.5); Thu, 16 Nov 1995 08:36:04 -0600 From: Scott McGee Received: by wales id ; Thu, 16 Nov 95 08:36:02 CST Date: Thu, 16 Nov 95 08:36:02 CST Message-Id: <9511161436.AA13171@wales> To: ANDRSN@hoover.stanford.edu, smcgee@microware.com Subject: Re: Web Tools and locating names Cc: genweb@UCSD.EDU Annelise Anderson writes: > >Scott says: > >>I agree with you that your thoughts would be an improved search method, but >>must point out that surname indecise are not universal, even among sites >>indexed at Gene's site. The only two things that are common to all such sites >>is the fact that a web page yeilds info on people, and a GENDEX.txt file of >>those people. (Actually, this impacts my method too, but then I had already >>mentioned that problem) >Scott, I was not thinking of anything so complicated. I was thinking about >a way to find identical people in the data bases that are being indexed >by Gene, not any other data bases. It just seemed reasonable to start with >a list sorted by surnames already; take the surname list and write a program >to find possible identities based on name or date of birth or whatever. Oh, sorry for missing that, but I beleive my point is still valid. For instance, if someone were to hand build their genweb site, or write their own software to do it, it may well provide the GENDEX.txt but not a surname index. My own software (used by several sites) produces a GENDEX.txt in Gene's format, but I don't know if my index would have the same information you are refering to or not. I certainly don't have a SURNAME file like Gene's does (though providing one wouldn't be difficult) On the other hand, the GENDEX.txt file seems to have the info you refer to. I have noticed, however, that going strictly by the info on a single person, MANY matches are missed. I found quite a few matches where there was a minor difference in dates (off by a few days, a month or even a year) but examining parents and children show that the persons are the same. Scott If at first, you don't succeed, | smcgee@microware.com (Scott McGee) go fry a hen. After all, fried | ----------------------------------------- chicken beats failure any time. | I was paid $5.00 to express these views! -------------> http://www.cc.utah.edu/~sam8644/homepage.html <------------- From list-relay@UCSD.EDU Thu Nov 16 10:13:13 1995 Received: from UCSD.EDU (mailbox1.ucsd.edu [132.239.1.53]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id KAA11885 for ; Thu, 16 Nov 1995 10:13:12 -0800 Received: from gate.ti.com (news.ti.com [192.94.94.33]) by UCSD.EDU (8.6.12/8.6.9) with ESMTP id KAA18501 for ; Thu, 16 Nov 1995 10:14:34 -0800 Received: from dad_sun.dadd.ti.com ([156.117.138.45]) by gate.ti.com (8.6.12/) with ESMTP id MAA16473 for ; Thu, 16 Nov 1995 12:14:32 -0600 Received: from mbr.dadd.ti.com (mbr.dadd.ti.com [156.117.138.61]) by dad_sun.dadd.ti.com (8.6.10/8.6.10) with SMTP id MAA08097; Thu, 16 Nov 1995 12:10:28 -0600 To: genweb@UCSD.EDU From: mbr@dadd.ti.com (Martin Roberts) Subject: Re: Web Tools and locating names: Demand model Date: Thu, 16 Nov 1995 11:10:53 cc: mbr@dadd.ti.com Message-ID: Hi, GENWEB people. I'm an amateur genealogist, but a professional software development manager. I follow your postings with interest, but I notice a problem. Maybe I missed it, but I haven't seen either a statement of the customer requirements (you seem to be iterating to it) or a statement of the plan objectives, strategies, and tactics (you state and restate the mission very well). It seems to me that you would benefit from a little more organization, especially considering that you are widely scattered. If you have already done these things, please forgive me. I am a recent subscriber to GENWEB. On this question of web crawlers, I am troubled by some assumptions. Annelise and Scott are exchanging ideas on automation of the process. Suppose I was going to use your future automated system to search for "loose ends" in my DB. I would run a script to generate the names and dates for all my "loose ends". How many is that? Well with the assumption that all ancestor trees are binary, that is approximately 1/2 the names in the tree! So now I have my list of loose ends, call it L. For a typical data base, L will be around 2000. When I do this manually as in submit names to surname lists I send in about 10. But with automation I will send them all! Now I send my 2000 names to the web crawler to search the online data bases. How many data bases are there on line? Call it D, and let !D! be the length of each. !D! is the same order of magnitude as L: !D!=2L by definition. So for this discussion D=4000. To get an estimate of D is more difficult. D is the percentage of genealogists who have their data in computers and also have it on line. Call these two factors DG and DGO. Call the number of genealogists G. This D=G*DG*DGO. We can estimate DG from the number of the general population who have PC's and use them. In the US this is about 5% now. So DG=.05. For genealogists that have computers, the percentage who build data bases is probably fairly high. So DGO is close to 1. Hence D = .05*G. G may be 300,000,000 worldwide, but I'll leave it as a variable So now the number of search requests is G*DG*L and the number of records to search is !D!*D or !D!*G*DG*DGO. The number of comparisons to be done by a web crawler is then the product: G*DG*L*!D!*G*DG*DGO or .5*(.05G)**2*!D!**2. With !D! = 4000, the total is .5(4000*.05G)**2 or 20000G**2 !!!!! In this analysis G*DG (the number of genealogists with online computers) and !D! (the size of their data bases) are growing numbers with time. The search time growth is proportional to the square of each of the growing numbers. So its growth is quadratic. I haven't discussed the web traffic. It can be analysed the same way but will be somewhat smaller by a factor of !D!. Finally I haven't considered how often one person would check. Probably only oce or twice a year. But some will undoubetedly do more. An average might be 4 times per year. So the demand rate for searches will be approximately (10**6)(G**2) per year. With G = 10**8, you have 10**14 accesses per year or 10**9/hour. I hope it is clear by now that I don't think web crawlers are a good idea unless people have to pay to use them. Comments welcome. From list-relay@UCSD.EDU Fri Nov 17 14:30:27 1995 Received: from UCSD.EDU (mailbox2.ucsd.edu [132.239.1.54]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id OAA17664 for ; Fri, 17 Nov 1995 14:30:27 -0800 Received: from dragon.ti.com (dragon.ti.com [192.94.94.61]) by UCSD.EDU (8.6.12/8.6.9) with ESMTP id OAA26897 for ; Fri, 17 Nov 1995 14:21:34 -0800 Received: from dad_sun.dadd.ti.com ([156.117.138.45]) by dragon.ti.com (8.6.12/) with ESMTP id QAA23450 for ; Fri, 17 Nov 1995 16:21:03 -0600 Received: from mbr.dadd.ti.com (mbr.dadd.ti.com [156.117.138.61]) by dad_sun.dadd.ti.com (8.6.10/8.6.10) with SMTP id QAA22997 for ; Fri, 17 Nov 1995 16:17:19 -0600 To: genweb@UCSD.EDU From: mbr@dadd.ti.com (Martin Roberts) Subject: Re: Web Tools and locating names: Demand model Date: Fri, 17 Nov 1995 15:17:47 Message-ID: In article Annelise Anderson writes: Annelise, I am responding to the list because Anders and Scott also responded to me personally and I'd like my reply to go to the list. Please excuse me for posting your response to me. First I'd like to point out one very important assumption I made in my first posting and give an example. Also my view of where the demand will come from. I assumed that automation in this area will progress to the point where a DB product like FTM will have a one button selection to generate all loose ends in your DB. It will pop up a menu allowing you to select names, or time periods, or ancestor paths, or somesuch. So I concede the point that my "L" number is too large and will be customer driven rather than proportional to "D". But I think every user will eventually ask for all their loose ends, so this only smears the time scale, not the total demand. Second, I work in the seimiconductor industry and we do technical forecasts. We have noticed in the last two years two very interesting phenomena: 1. The number of PC's in use grows faster than ANY forecast. 2. The number of PC's etc connected to the net also grow faster than any forecast (this comes from Nick Negroponte at MIT - not me) Also, a new figure: most households in the US that have PC's have more than one. That's certainly true in my family. I have two, my children have 2,0,0,1,2. If you read tech forecasts in the communication area (or just read newspapers!) you will see that very large companies are competing to connect you to the net through your cable connection, your phone line, and who knows what else. The phone company is putting optical lines in my town as fast as they can. I just received a brochure from France telling how their net connections are expanding. They are ahead of us with 38% of households connected, planning to go over 50%. They don't even use PC's! Look at the countries represented by genealogy inquiries on the newsgroups. Recently I have seen India, Russia, Turkey. Negroponte told us that in Senegal in the public schools the average 15 year old knows as much about computers as the average 15 year old in Los Angeles. On another topic, the US, and particularly the Mormon church, are not the world leaders in family history studies. That honor goes to the Chinese who do not have many PC's YET. Other countries with a strong tradition of family history study are Japan, Indonesia, Italy, Spain. Annelise, I'm sorry that you have only 300 names in your DB. I assume you are much younger than I, and that you are a very recent immigrant, where my family has been here for 200 years or more. So I certainly have benefited more from published resources than you have. Why did I pick the number 4000? It is 2**12. I have grandchildren. My grandchildren have in my data base several lines going back 16 generations! I thought 4000 is a good average to allow some short lines and some sibling descendants. Personally I have benefited more from the knowledge and work of cousins than from individual research. By having other cousins in my DB, if I search for them in yours, and you know them personally, then I will get a new contact, not just a name match. I have many living cousins whose names I know but have I have no idea where they are. If they married one of your cousins I can locate them through you. Therefore I will serach on cousins, and not just the direct line as Annelise suggests. Finally, I think you have ignored the rapidity of how genealogy record can be put on line right now. The phone company CD's is an example. I would imagine that publishers like Clearfield will soon be offering disks instead of books. I have only two genealogy books that I own. I got about 15 names out of one. It has 5k-8k names. I got about 60 out of the other. It has 13,000 names. What I see is that before long the whole book will be online, and the number of such books is very large. I have a privately printed book, I can scan it today, turn it to text, and edit it into something resembling a GEDCOM file. I bet that before I get it done, someone would offer a service to do it. So I expect the amount of data on line to grow exponentially. Before long Lets look at the problem differently. How many US residents who were born before 1850 have their names in a genealogy book or DB? Say it is 5 million. How many is it possible to identify given all the records tha exist but have not been codified? Say it is only 10 million. Now consider how many descendants those 10 million have and do my calculation again. I think the number of these would reduce the total population size significantly, say by a factor of 4. But I would expect the percentage of these descendants who are interested in genealogy to be much higher than the amount I assumed. So the G number shouldn't change much. What happens when all the people who have an interest in genealogy and are descendants of the 10 million construct their DB's? How big will their DB's be? That gets you right back to the 4000 number by my previous calculations. Another complicating point: If I search a Chinese DB with my web crawler, it should quickly check all the names I am submitting and reply that none will be in the DB because it is in Chinese. But I am one of the descendants of the 10 million named above. What is the expected number of hits for me searching other DB's for other descendants of those 10 million? It is pretty high. And when I hit I can have a lot of name matches - especially when I share a line with someone else. every leaf in my ancestor tree will correspond to a leaf in their tree, and if it is the same there will be a match. Most of these matches will be unproductive because of using the same sources. But my assumptions about the size of the DB's still hold so the number of requests and comparisons is still valid. It is much worse to have a population of genealogists who are closely related than to have separate populations. If I have 10,000 names, and I connect someone to one of my lines, they could duplicate the whole 10,000. But Annelise, because she won't connect, won't be able to grow her DB as rapidly. But I expect the number of people who do what I am doing (look for previous) genealogies rather than basic research) to grow fairly rapidly with the on-line DB's because it will be so easy to do. Success breeds confidence and the willingness to keep on trying. Therefore the number of large DB's is likely to grow faster than the number of total DB's. I'm going to buy a scanner. I recommend it for all genealogists. >Hi--replying not for the list.... >>Hi, GENWEB people. I'm an amateur genealogist, but a professional software >>development manager. I follow your postings with interest, but I notice >>a problem. Maybe I missed it, but I haven't seen either a statement >>of the customer requirements (you seem to be iterating to it) or a >>statement of the plan objectives, strategies, and tactics (you state >>and restate the mission very well). It seems to me that you would benefit from >>a little more organization, especially considering that you are widely >We restate the mission well? I'd be curious what you have found to be >the mission. My sense is that this isn't really an *organization* that >has a statement of plan objectives or strategies or tactics, but rather >what the Economist magazine would call a "club"--people who get together >and talk but do whatever they like or are interested in doing. There >are actually a bunch of organization plans in the archives somewhere, but >they do not result in assignments, positions filled or vacant, etc.! Annelise, by the mission I mean to search an on line DB for names of interest and return information about the search to the requestor. This gets restated daily by one of you in some form or other. >Not sure who the customers are....if there are any! >>On this question of web crawlers, I am troubled by some assumptions. >>Annelise and Scott are exchanging ideas on automation of the process. >>Suppose I was going to use your future automated system to search for "loose >>ends" in my DB. I would run a script to generate the names and dates for all my >>"loose ends". How many is that? Well with the assumption that all ancestor >>trees are binary, that is approximately 1/2 the names in the tree! So >>now I have my list of loose ends, call it L. For a typical data base, >>L will be around 2000. When I do this manually as in submit names to >>surname lists I send in about 10. But with automation I will send them all! >>Now I send my 2000 names to the web crawler to search the online data bases. >>How many data bases are there on line? Call it D, and let !D! be the length >>of each. !D! is the same order of magnitude as L: !D!=2L by definition. So for >>this discussion D=4000. To get an estimate of D is more difficult. D is >>the percentage of genealogists who have their data in computers and also >>have it on line. Call these two factors DG and DGO. Call the number of >>genealogists G. This D=G*DG*DGO. We can estimate DG from the number of the >>general population who have PC's and use them. In the US this is about >>5% now. So DG=.05. For genealogists that have computers, the percentage who >>build data bases is probably fairly high. So DGO is close to 1. Hence >>D = .05*G. G may be 300,000,000 worldwide, but I'll leave it as a variable >>So now the number of search requests is G*DG*L and the number of records >>to search is !D!*D or !D!*G*DG*DGO. The number of comparisons to be done by >>a web crawler is then the product: G*DG*L*!D!*G*DG*DGO or >>.5*(.05G)**2*!D!**2. With !D! = 4000, the total is .5(4000*.05G)**2 or >>20000G**2 !!!!! >>In this analysis G*DG (the number of genealogists with online computers) >>and !D! (the size of their data bases) are growing numbers with time. >>The search time growth is proportional to the square of each of the growing >>numbers. So its growth is quadratic. >>I haven't discussed the web traffic. It can be analysed the same way >>but will be somewhat smaller by a factor of !D!. >>Finally I haven't considered how often one person would check. Probably only >>oce or twice a year. But some will undoubetedly do more. An average might be >>4 times per year. So the demand rate for searches will >>be approximately (10**6)(G**2) per year. With G = 10**8, you have 10**14 >>accesses per year or 10**9/hour. >>I hope it is clear by now that I don't think web crawlers are a good idea >>unless people have to pay to use them. >>Comments welcome. >Sounds awful! But, a few points: >Few data bases are 4000 people--although a few people involved in this have >such large ones. Or more than one. Second, most of us aren't interested >in following all these lines, but rather only the direct ones. So, e.g., >I've got 80-some direct-line ancestors "discovered"--or 40 loose ends to >follow, I guess. >The number of genealogists with computers *and* internet access is perhaps >not (now) very large. Gene Stark's index to on-line databases has 50-60 >contributors (although growing constantly). Matt Helm has indexed info >from upwards of 300 on-line data bases as far as I can tell. Mickey Lane, >who created Rootsbook, had quite a few people/data bases but isn't doing >it any more (I don't have the precise numbers). Cliff Manis, who runs >GENSERV, has something like 1100+ contributors of databases and over >1 million individuals in these databases. GENSERV is different in that >it is accessible by e-mail. There's a lot of overlap in these data >bases, i.e., my data base, a poor effort of some 290 people in its most >recent incarnation, is in all four places. I read somewhere about accessing >www sites by electronic mail and/or fax, but don't know where I saw it. >But the real point is that there's not that much data on the www yet--lots >of genealogists don't know much about computers except for their own >genealogy program and word processing (as you will find out many people >have trouble unsubscribing from this list), some of them have e-mail and >access to newgroups, fewer are able to put their data on a web server or >download and run the programs necessary to create html or present the >data in any other format. >I myself have been interested in the probability of finding one or more >(or no) ancestors on the net given.....I have found this intractable, >although I did work out the probability that for any two people of >European ancestry (and this is still a European/American endeavor, mostly) >there would be no ancestors in common at the 11th generation (where each >person has 1024 ancestors). (That assumes we know who they all are.) Thats something that interests me also. I have noticed that I see about two surnames per week that I recognize from my DB, and I find about one or two per month that have some sort of connection. I've only added about a dozen names to my DB through the net, but I think I have helped substantially more people. What I have learned is to get a better sense of how many people there are in the world! Looking through the surname list day after day, 200 names a day, and not seeing one common name for days. It gives you a sense of space. >If I were in the genealogy software business I'd try to license Gene >Stark's program and offer a site on the web for posting genealogical >data. But even then--the size of the market might not warrant the >expense. >Does that change your picture of things? >Annelise > From list-relay@UCSD.EDU Fri Nov 17 17:12:45 1995 Received: from UCSD.EDU (mailbox2.ucsd.edu [132.239.1.54]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id RAA18157 for ; Fri, 17 Nov 1995 17:12:43 -0800 Received: from pimaia2y.prodigy.com (pimaia2y.prodigy.com [192.207.105.55]) by UCSD.EDU (8.6.12/8.6.9) with ESMTP id RAA04319 for ; Fri, 17 Nov 1995 17:13:04 -0800 Received: from mail.prodigy.com ([199.4.137.13]) by pimaia2y.prodigy.com (8.6.10/8.6.9) with SMTP id TAA31948; Fri, 17 Nov 1995 19:54:10 -0500 Date: Fri, 17 Nov 1995 19:53:36 EST From: XCEE48A@prodigy.com (DR DOC BEGNAL-YOUNG) X-Mailer: PRODIGY Services Company Internet mailer [PIM 3.2-319.50] Message-Id: <013.05035964.XCEE48A@prodigy.com> To: jayhall@xmission.com, garcher@bix.com, genserv-info@progcons.com, lincust@bei.net, stark@cs.sunysb.edu, ctwitty@gate.byu.edu, genweb@UCSD.EDU, bnsnews@netcom.com, davids@lightlink.satcom.net Subject: Happy Thanksgiving! -- [ From: D.R."Doc" Begnal-Young * EMC.Ver #2.10P ] -- CC-To: inversion Expanded recipient data: To: Lucille Shea \ PRODIGY: (KARY75A) cc: Bonnie Scott \ PRODIGY: (UCAG62A) cc: Catherine Metzler \ PRODIGY: (BAPM37A) cc: Mr Gary D Hood \ PRODIGY: (KFRS95A) cc: Linda Doll \ PRODIGY: (EZFU85A) cc: Lorna Fanjoy \ PRODIGY: (FHZK02A) cc: Myra Gormley \ PRODIGY: (EXPT45C) cc: Nancy Curran \ PRODIGY: (MBFH73A) cc: Frank Byrne \ PRODIGY: (ZZHB53A) cc: John Belasco \ PRODIGY: (BHWB30A) cc: Ruth Belasco \ PRODIGY: (DUTM85A) cc: Brenda Sue \ PRODIGY: (ECKX18B) Happy Thanksgiving to you & your family. I applaud your genealogy research efforts for whatever your reasons. I give thanks for modern day technology, the computer, online services and access to the Internet. For these have saved time & money in my research. I want to thank those that have been instrumental in my genealogy research by sending copies of birth, death, marriage certificates, obituaries, census records, help & advise, etc. You truly have helped a lot. If you have Web capabilities, please visit my updated web pages; Genealogy Lineage Newsletter Main Page, http: //pages.prodigy.com/CA/xcee48a/ Links to Genealogy On theWeb, http: //pages.prodigy.com/CA/xcee48a/genalogy.html With this modern day technology I have bridged the other online services & Internet with my newsletter & Web homepages thus creating greater exposure for those researching. In indexing my first Annual Publication, Quarterly Issues & downloading postings from Prodigy's Genealogy BB, I was able to discover 4 people researching James Wilson, signer of the Declaration of Independence and 2 people researching Robert L. Wilson. 1 person on the Internet, 3 on Prodigy and 2 on AOL. Their paths may never have crossed and my guess would be, they are related. I was in the position to forward this information on. This makes me very happy, especially this time of year. I know the meaning of finding your past. My research has united several half brothers & sisters to my family. It has allowed my mother to see pictures and know who her father was that she never met. The same for my father, before his passing away this year. I never knew my father until I was 20. Once reunited, I had a half brother & 3 half sisters. I was reunited with my son after 11 years, after placing him up for adoption. My cousin was able to meet her brother, who each never knew existed. With the work I have accomplished is one of the greatest gifts I can give my family and children. You or your services has made this possible. Gratefully, D.R."Doc" Begnal-Young Genealogy Lineage Newsletters- For Begnal-Belasco-Pereira-Wilson 852 Webb Ave. Holtville, CA 92250 619-356-2651 Prodigy: XCEE48A Internet: xcee48a@prodigy.com Web: http://pages.prodigy.com/CA/xcee48a/ From list-relay@UCSD.EDU Fri Nov 17 18:23:30 1995 Received: from UCSD.EDU (mailbox2.ucsd.edu [132.239.1.54]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id SAA18255 for ; Fri, 17 Nov 1995 18:23:30 -0800 Received: from gate.microware.com (gate.microware.com [198.17.151.51]) by UCSD.EDU (8.6.12/8.6.9) with SMTP id SAA05982 for ; Fri, 17 Nov 1995 18:18:00 -0800 Received: by gate.microware.com; id AA17286; Fri, 17 Nov 95 20:15:53 CST Received: from mcrware.microware.com(192.52.109.32) by gate.microware via smap (g3.0.1) id xma017284; Fri, 17 Nov 95 20:15:42 -0600 Received: from wales (wales.microware.com) by mcrware.microware.com with SMTP id AA23694 (5.67a8/IDA-1.5 for ); Fri, 17 Nov 1995 20:17:23 -0600 From: Scott McGee Received: by wales id ; Fri, 17 Nov 95 20:17:22 CST Date: Fri, 17 Nov 95 20:17:22 CST Message-Id: <9511180217.AA15282@wales> To: genweb@UCSD.EDU Subject: Re: Web Tools and locating names: Demand model mbr@dadd.ti.com (Martin Roberts) writes: > >Hi, GENWEB people. I'm an amateur genealogist, but a professional software >development manager. I follow your postings with interest, but I notice >a problem. Maybe I missed it, but I haven't seen either a statement >of the customer requirements (you seem to be iterating to it) or a >statement of the plan objectives, strategies, and tactics (you state >and restate the mission very well). It seems to me that you would benefit from >a little more organization, especially considering that you are widely >scattered. If you have already done these things, please forgive me. I am >a recent subscriber to GENWEB. We welcome comments in our little community. Basically, genweb is in an early eperimental stage. There are still relatively few genweb sites (though it is growing) and we are _very_ early in the experimental stages of inter database linking, and database indexing. Little (that I know of) has even been done experimentally on distributed databases. Basically, we are exploring some of the possibilities of this concept and trying to help it past some of the rocks ahead. >On this question of web crawlers, I am troubled by some assumptions. >Annelise and Scott are exchanging ideas on automation of the process. First, like I say, we are only experimenting so far. We were discussing some ways a web crawler could traverse the databases online now, and possibly provide name matching information to owners. While your math looks good enough for me, I think you are missing a key point or two. Currently, the GENDEX site maintained by Gene Stark indexes a few dozen databases. What we are talking about is something that will work in that environment to help us learn how to aproach things when there are hundreds of thousands of databases online. In other words, a limited experiment to help locate what issues will need addressing. Also, we weren't talking about a system where you would submit requests for searches on your end lines, but rather a system where the web crawler would slowly accumulate data on all the entries in each database, reporting any likely matches to owners of the databases. I, for example, would "register" my database with the crawler site, and when it got to my database, it would start to accumulate information on the people in my database, and add it to the database it is building. If it finds a match with some other database it has seen, I would be notified along with the owner of the other database. Some things here, it need not do all of my database at once, it could schedule say 10 to 100 enteries per night in as many databases it needs to keep busy. Thus impact on the database sights is very low. It also need not bring over all info in a database. For instance, each of my pages might have links to pictures, history/journal stuff, source references, geographical data, other report forms, etc. None of that need be transmitted. Just downlink the http file for one person, examine it for information on parents, spouses, and children, check to see if other individual entries are included in the same file, and move on. Better than even this is the possibility of self-indexed sites which would prepare all the information needed into a single file. The program would obtain this file for a database, and digest it, then move on. If the crawler does 10 individual per database max each night, and has to examine 10 other pages to obtain info on all parents, children, and spouses, (I suspect more like 4 would be normal), a site might recieve requests for 110 pages. For a 4000 person database (assuming caching of people looked up for spouse/parent/child searches, the database would be fully indexed in just over a month, having had 4000 pages requested. Assuming that the robot site has the resources to handle 10 such databases per night, it could index over 30,000 names a month for as long as it ran. I suspect this could, with some testing, be cranked up by a factor of 10 to 100 without too much trouble, allowing indexing of as many as what, 36 million names a year. That might strain the net resources of one site, (and certainly the storage ability) but an insignificant impact on the net in general. (Let me note here that I am thinking in terms of tested robot. An experimental one should be able to do the 30,000 a month (360,000 a year) type once working well.) Does this make my thoughts on web crawler searching more reasonable? Scott Buttered bread always lands butter side * Would YOU mistake these as down (Unless it sticks to the ceiling!) * anyone`s opinions but my own? Email: smcgee@microware.com (Scott McGee) Web: http://www.cc.utah.edu/~sam8644/homepage.html From list-relay@UCSD.EDU Fri Nov 17 20:18:31 1995 Received: from UCSD.EDU (mailbox2.ucsd.edu [132.239.1.54]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id UAA18439 for ; Fri, 17 Nov 1995 20:18:30 -0800 Received: from desiree.teleport.com (desiree.teleport.com [192.108.254.21]) by UCSD.EDU (8.6.12/8.6.9) with ESMTP id UAA08223 for ; Fri, 17 Nov 1995 20:18:48 -0800 Received: from ip-bend1-16.teleport.com (ip-bend1-16.teleport.com [204.245.213.16]) by desiree.teleport.com (8.6.12/8.6.9) with SMTP id UAA24912; Fri, 17 Nov 1995 20:18:35 -0800 Received: by ip-bend1-16.teleport.com with Microsoft Mail id <01BAB52A.4F484460@ip-bend1-16.teleport.com>; Fri, 17 Nov 1995 20:21:42 -0800 Message-ID: <01BAB52A.4F484460@ip-bend1-16.teleport.com> From: Jeff Murphy To: "genweb@ucsd.edu" , "'Scott McGee'" Subject: RE: Web Tools and locating names: Demand model Date: Fri, 17 Nov 1995 20:18:34 -0800 MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable We welcome comments in our little community. Basically, genweb is in an early eperimental stage. There are still relatively few genweb sites = (though it is growing) and we are _very_ early in the experimental stages of = inter database linking, and database indexing. Little (that I know of) has = even I'm in the process of trying to set up a "genweb" site. Most of the = comments here have been on such an esoteric level that I haven't been = quite sure whether to jump in with a question or two. I take your = response to the other person as an invitation. In trying to find out how best to set up my web page, I've wandered = around looking for examples. Someone introduced me to Gene Stark's = software, ged2html, which in theory will (eventually) convert my gedcom = to a set of html pages. I say eventually because I've tried running it = 3 different times, and have finally terminated processing after 3-4 = hours. My gedcom has 22,000 names, and I'm running on a 486-100. Since = he reported a better time on a /33 with a larger database, I figured = there was a good chance mine'd be done in an hour or two. No such luck. = If anyone has experience with this program, I'd like to talk to them. = Specifically, I don't know what the end result will look like; so what = would be the effect of changing the number of people from 10 per file = to, say, 100; or the pedigree number from 2 to 5? The docs are silent. I have a web page working, for those interested. It just doesn't = contain the kind of genealogy data that I was hoping to get from this = program. My page specializes in families with lines in Muhlenberg Co., = KY, and can be seen at http://www.teleport.com/~jmurphy . I'm going to try a really small gedcom, so I can at least see the = results. This program doesn't seem to want to work on this one. It = appears to lock up with a bunch of disk i/o. The system clock stops, = Windows95 stops, and if I were actually running anything else at the = same time, I wouldn't be able to get to it either. Has anyone ever suggested the possibility of using an common icon on all = genweb sites? I saw one that someone else had created. Would like to = add it to my page, once I manage to get the data links for the families = set up. Thought about using it for the background, but it seemed too = busy when I tried it. Jeff Murphy jmurphy@teleport.com From list-relay@UCSD.EDU Sat Nov 18 07:56:23 1995 Received: from UCSD.EDU (mailbox2.ucsd.edu [132.239.1.54]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id HAA20884 for ; Sat, 18 Nov 1995 07:56:21 -0800 Received: from sbstark.cs.sunysb.edu (sbstark.cs.sunysb.edu [130.245.1.47]) by UCSD.EDU (8.6.12/8.6.9) with ESMTP id HAA18072 for ; Sat, 18 Nov 1995 07:55:50 -0800 Received: (from root@localhost) by sbstark.cs.sunysb.edu (8.6.12/8.6.9) with UUCP id KAA02487; Sat, 18 Nov 1995 10:55:29 -0500 Received: (from gene@localhost) by starkhome.cs.sunysb.edu (8.6.11/8.6.9) id KAA11385; Sat, 18 Nov 1995 10:54:29 -0500 Date: Sat, 18 Nov 1995 10:54:29 -0500 From: Gene Stark Message-Id: <199511181554.KAA11385@starkhome.cs.sunysb.edu> To: Jeff Murphy Cc: genweb@UCSD.EDU In-reply-to: Jeff Murphy's message of Fri, 17 Nov 1995 20:18:34 -0800 Subject: RE: Web Tools and locating names: Demand model References: <01BAB52A.4F484460@ip-bend1-16.teleport.com> >In trying to find out how best to set up my web page, I've wandered = >around looking for examples. Someone introduced me to Gene Stark's = >software, ged2html, which in theory will (eventually) convert my gedcom = >to a set of html pages. I say eventually because I've tried running it = >3 different times, and have finally terminated processing after 3-4 = >hours. My gedcom has 22,000 names, and I'm running on a 486-100. Since = >he reported a better time on a /33 with a larger database, I figured = >there was a good chance mine'd be done in an hour or two. No such luck. = > If anyone has experience with this program, I'd like to talk to them. = Why don't you mail me directly? If you can get me your gedcom, I will try it myself and debug the problem, if I can reproduce it. How much RAM do you have? If you don't have enough, it will take an extremely long time to process a GEDCOM of the size you mention. You can probably do it with 8MB, but the more the better. If you have 4MB, I wouldn't even try it. The GEDCOM source probably already exceeds that size, so you will experience massive ``thrashing'' of the hard drive as the program attempts to build the database. >Specifically, I don't know what the end result will look like; so what = >would be the effect of changing the number of people from 10 per file = >to, say, 100; or the pedigree number from 2 to 5? The docs are silent. The easiest thing to do here is play with different options yourself, or else scout out some of the 50-odd databases on the net that have been constructed using this program, and see what they have done. The ones I know about are accessible from: http://bsd7.cs.sunysb.edu/~stark/genweb_index - Gene Stark From list-relay@UCSD.EDU Sat Nov 18 08:20:15 1995 Received: from UCSD.EDU (mailbox1.ucsd.edu [132.239.1.53]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id IAA20904 for ; Sat, 18 Nov 1995 08:20:15 -0800 Received: from roxy.sfo.com (roxy.sfo.com [205.162.14.50]) by UCSD.EDU (8.6.12/8.6.9) with ESMTP id IAA25662 for ; Sat, 18 Nov 1995 08:17:10 -0800 From: mavrogeorge@genealogysf.com Received: from 205.162.14.118 (sf-118.sfo.com [205.162.14.118]) by roxy.sfo.com (8.6.12/8.6.12) with SMTP id IAA09742 for ; Sat, 18 Nov 1995 08:15:04 -0800 Date: Sat, 18 Nov 1995 08:15:04 -0800 Message-Id: <199511181615.IAA09742@roxy.sfo.com> MIME-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit Subject: Re: genealogy pages To: genweb@UCSD.EDU In-Reply-To: <199511152245.RAA00595@worf.worx.net> X-Mailer: SPRY Mail Version: 04.00.06.17 Just noticed inthe latest Roots USer Group newsletter that an update to R4 is going to include the ability to generate a html page from a R4 database. Anyone know any details? How they are going to do it? From list-relay@UCSD.EDU Sat Nov 18 20:13:17 1995 Received: from UCSD.EDU (mailbox1.ucsd.edu [132.239.1.53]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id UAA21951 for ; Sat, 18 Nov 1995 20:13:16 -0800 Received: from gate.microware.com (gate.microware.com [198.17.151.51]) by UCSD.EDU (8.6.12/8.6.9) with SMTP id UAA05377 for ; Sat, 18 Nov 1995 20:13:28 -0800 Received: by gate.microware.com; id AA23622; Sat, 18 Nov 95 22:11:21 CST Received: from mcrware.microware.com(192.52.109.32) by gate.microware via smap (g3.0.1) id xma023620; Sat, 18 Nov 95 22:10:57 -0600 Received: by mcrware.microware.com id AA21778 (5.67a8/IDA-1.5 for genweb@ucsd.edu); Sat, 18 Nov 1995 22:12:38 -0600 Date: Sat, 18 Nov 1995 22:12:38 -0600 From: Scott McGee Message-Id: <199511190412.AA21778@mcrware.microware.com> To: genweb@UCSD.EDU Subject: My genweb software Content-Length: 1689 This is just a survey to see how much interest there is/has been in my genweb software. I have made it availible in two forms. One consists of the lifelines report extract_html.ll or dump_html.ll and produces static HTML files from the LifeLines database. The other is a series of report programs and cgi scripts that work together to produce HTML files on demand from the LifeLines database. I know of at least one site using the first. If you are using or have used my software to create a genweb site, I would like to hear from you. Let me know how you like it, what you would like done differently, any other comments. Also, let me know if you are interested in updates to the programs. I would like to have a URL for your site if it is publicly availible. Oh, I have made some nice additions to the features availible with the CGI method software. I am getting ready to update the release package availible for http downloading, and if demand justifies, I can make all my programs availibe for ftp too. Let me know. Thanks Scott PS Stop by my genweb page and check out my latest stuff. I have seven databases availible (I am also offering to serve databases for others who don't have facilities to do so) and thanks to another user of my software, have added search capability and have also added logging. The URL is: http://www.emcee.com/~smcgee/genweb/genweb.html If at first, you don't succeed, | smcgee@microware.com (Scott McGee) go fry a hen. After all, fried | ----------------------------------------- chicken beats failure any time. | I was paid $5.00 to express these views! -------------> http://www.cc.utah.edu/~sam8644/homepage.html <------------- From list-relay@UCSD.EDU Sun Nov 19 04:37:29 1995 Received: from UCSD.EDU (mailbox2.ucsd.edu [132.239.1.54]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id EAA24191 for ; Sun, 19 Nov 1995 04:37:29 -0800 Received: from ProgCons.COM (flattop.fc.net [204.157.166.66]) by UCSD.EDU (8.6.12/8.6.9) with SMTP id EAA06243 for ; Sun, 19 Nov 1995 04:36:10 -0800 Received: by ProgCons.COM (Smail3.1.28.1b #3) id m0tH8yK-0001CfC; Sun, 19 Nov 95 06:35 CST Message-Id: From: cmanis@ProgCons.COM (Cliff Manis) Subject: GenServ System Info and Homepage To: genweb@UCSD.EDU Date: Sun, 19 Nov 1995 06:35:48 -0600 (CST) Cc: cmanis@soback.kornet.nm.kr X-Mailer: ELM [version 2.4 PL24] Content-Type: text Content-Length: 1651 GenServ Genealogical GEDCOM Server System Over 1,400 different GEDCOM databases ON-LINE ! Surnames ON-LINE ! ! More than 1,700,000 names ON-LINE in GEDCOM data in the GenServ database World-Wide GEDCOM data GenServ contains genealogical data originally submitted as GEDCOM databases from an ever-growing number of genealogists in the USA and various other countries. More than 300 New GEDCOM data files were added to this system since 1 Oct. This SERVER may have the data you need now and be a tremendous aid to your family research. We are adding 150+ GEDCOM datafiles each month to this system. We are many different reports available for your requesting the data available from GenServ. If interested - please read the latest GenServ DOCS dated 13 November 1995. You should receive the DOCS within two hours of your request to address below. The Docs are also available from the Homepage via ftp. Please read the DOCs before sending any GEDCOM datafiles. Many interesting and helpful points are available from the GenServ homepage. Please visit it anytime. We want your GEDCOM data - Please consider us a GEDCOM file Complete documentation with a description of report types is available by sending any email message to this address. For complete GenServ Docs to: genserv-doc@ProgCons.COM WWW - The GenServ Homepage http://soback.kornet.nm.kr/~cmanis/ -- Cliff Manis cmanis@progcons.com Seoul, Korea GenServ "Genealogical Server" a service for making GEDCOM data available. From list-relay@UCSD.EDU Mon Nov 20 21:16:28 1995 Received: from UCSD.EDU (mailbox2.ucsd.edu [132.239.1.54]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id VAA29669 for ; Mon, 20 Nov 1995 21:16:27 -0800 Received: from roxy.sfo.com (roxy.sfo.com [205.162.14.50]) by UCSD.EDU (8.6.12/8.6.9) with ESMTP id VAA09399 for ; Mon, 20 Nov 1995 21:15:13 -0800 From: mavrogeorge@genealogysf.com Received: from 205.162.14.104 (sf-104.sfo.com [205.162.14.104]) by roxy.sfo.com (8.6.12/8.6.12) with SMTP id VAA04210 for ; Mon, 20 Nov 1995 21:12:32 -0800 Date: Mon, 20 Nov 1995 21:12:32 -0800 Message-Id: <199511210512.VAA04210@roxy.sfo.com> MIME-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit Subject: GEDCOM converter To: genweb@UCSD.EDU X-Mailer: SPRY Mail Version: 04.00.06.17 >>From: vgharris@onramp.net >>To: mavrogeorge@genealogysf.com >>I have checked your material, but I cannot find a >>copy of G2HWIN.EXE. Anyone know where I can find this? From list-relay@UCSD.EDU Thu Nov 23 16:33:10 1995 Received: from UCSD.EDU (mailbox2.ucsd.edu [132.239.1.54]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id QAA10818 for ; Thu, 23 Nov 1995 16:33:09 -0800 Received: from gate.microware.com (gate.microware.com [198.17.151.51]) by UCSD.EDU (8.6.12/8.6.9) with SMTP id QAA18347 for ; Thu, 23 Nov 1995 16:28:26 -0800 Received: by gate.microware.com; id AA15946; Thu, 23 Nov 95 18:26:20 CST Received: from mcrware.microware.com(192.52.109.32) by gate.microware via smap (g3.0.1) id xma015944; Thu, 23 Nov 95 18:26:06 -0600 Received: from wales (wales.microware.com) by mcrware.microware.com with SMTP id AA15031 (5.67a8/IDA-1.5); Thu, 23 Nov 1995 18:27:47 -0600 From: Scott McGee Received: by wales id ; Thu, 23 Nov 95 18:27:45 CST Date: Thu, 23 Nov 95 18:27:45 CST Message-Id: <9511240027.AA02471@wales> To: elijah-l@emcee.com, genweb@UCSD.EDU, lines-l@vm1.nodak.edu Subject: My GenWeb site (Special note to those on Elijah: I am still not recieving the list so if you choose to respond, please do so by private email too.) I am happy to announce that I now serve eight databases on my GenWeb site with my LifeLines based CGI software. They have a total of nearly 40,000 names. In addition to my own databases and some served for others, I am serving a couple of databases of general intrest too. These are the Royal92 database which features the royalty of Europe, and the Mayflower database which features many of the ancestors and descendants of the passengers of the Mayflower. I will happily help others set this software up on their own site (you'll have to be running unix to use Lifelines) or to serve your database on my own site. To see my GenWeb site, point your browser to the URL: http://www.emcee.com/~smcgee/genweb/genweb.html and check out my genealogy page at http://www.emcee.com/~smcgee too. It has genealogy resource pointers and my collection of historical/ biographical files on many of my ancestors and others of note. Scott Buttered bread always lands butter side * Would YOU mistake these as down (Unless it sticks to the ceiling!) * anyone`s opinions but my own? Email: smcgee@microware.com (Scott McGee) Web: http://www.cc.utah.edu/~sam8644/homepage.html From list-relay@UCSD.EDU Fri Nov 24 17:08:20 1995 Received: from UCSD.EDU (mailbox1.ucsd.edu [132.239.1.53]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id RAA14791 for ; Fri, 24 Nov 1995 17:08:19 -0800 Received: from smtp.surfutah.com (salmon.iserver.com [204.212.248.12]) by UCSD.EDU (8.6.12/8.6.9) with ESMTP id RAA24947 for ; Fri, 24 Nov 1995 17:04:23 -0800 Received: from myer.byu.edu by smtp.surfutah.com; Fri, 24 Nov 1995 18:03:28 -0700 Date: Fri, 24 Nov 1995 18:03:28 -0700 Message-Id: <199511250103.SAA25156@smtp.surfutah.com> X-Sender: rex@pop.surfutah.com X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: elijah-l@emcee.com, genweb@UCSD.EDU From: Rex Myer Subject: Webified Genealogy I have recently started a service to present genealogy on the Web in a sort of internet library. So far, the service has over 20,000 individual records available. It has surname search by exact name and soundex spelling. It also has a general search for any record in the databases online. It has graphical pedigree charts, outline individual records and family records, and descendant charts. The engine which generates the charts and does the searches is a composition of CGI scripts. This means that rather than have all the individual charts, etc. in separate files taking up space on your server, the charts are generated dynamically. The data it uses to generate the charts and do the searches are in GEDCOM files submitted by the subscribers to the service. I invite you to come and visit (if you haven't already), to do some searches, and/or just to see another method of making genealogical information available on the web. If you would like to subscribe, let me know at the e-mail address below. The URL for the service called Webified Genealogy (WebGen) is at: http://www.surfutah.com/web/webgen/ Thank You, Rex Myer (owner of Webified Genealogy) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Rex Myer Owner, WebGen rex@surfutah.com http://www.surfutah.com/web/webgen/ From list-relay@UCSD.EDU Sun Nov 26 23:27:07 1995 Received: from UCSD.EDU (mailbox1.ucsd.edu [132.239.1.53]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id XAA22313 for ; Sun, 26 Nov 1995 23:27:07 -0800 Received: from DB2.Stanford.EDU (DB2.Stanford.EDU [36.38.0.46]) by UCSD.EDU (8.6.12/8.6.9) with ESMTP id XAA27527 for ; Sun, 26 Nov 1995 23:23:20 -0800 Received: (from quass@localhost) by DB2.Stanford.EDU (8.7.2/8.7.2) id XAA47381; Sun, 26 Nov 1995 23:23:19 -0800 Date: Sun, 26 Nov 1995 23:23:19 -0800 From: Dallan Quass Message-Id: <199511270723.XAA47381@DB2.Stanford.EDU> To: genweb@UCSD.EDU Subject: Finding matches in genweb Cc: quass@db.stanford.edu Please let me summarize the recent set of messages about finding matching ancestors among genweb files to make sure I understand the issues and add my two cents. Problem: Given information about your ancestors, scan other genealogical data on the web to find possible matches. The goal is to set up a "matching service," where you give the URL of your genealogy data to the matching service and it determines possible matches among the other genealogy files on the web and e-mails you the results. Challenges: A. Access the genealogical data. B. Determine programatically when two people are matches. Alternatives: A. Access the genealogical data. A1. Write a web crawler to parse genealogical data represented as HTML pages back into (gedcom) data elements so that the data can be compared against your data. A2. Compare your genealogical data in GENDEX format with other GENDEX files available on the web. A3. Compare your genealogical data in GEDCOM format with other GEDCOM files available on the web. B. Determine programatically when two people are matches. B1. Match on exact names. B2. Match on a weighted combination of name, birth date and place, death date and place, (i.e., personal data fields). B3. In addition to matching on personal data, include in the match determination whether relatives of the person are matches. Is this an accurate representation of the issues? Advantages/Disadvantages of the alternatives for accessing genealogical data: A1. Parse HTML pages This approach has the potential to access the most data, since everyone makes their data available as HTML files (whether statically or dynamically generated). The difficulty with this approach lies in understanding the difference between current web crawlers (e.g., lycos, WebCrawler) and what we need to do here. Current web crawlers index the words in a document without looking at their context. For example, a date is just a date -- they don't distinguish between a birth date, death date, or marriage date. So you can't ask for people _born_ on a certain date. You can't usually even search for a particular date like 2 Feb 1900, since an HTML page would likely match your search result if the words "2" "Feb" and "1900" appeared anywhere on the page, not necessarily adjacent. And what if the date on the page was written as 2/2/1900? In order to take context into account when you parse an HTML page you need to have a grammar for the page; i.e., you need to know that the birth date is preceeded by a "Born:" label, that it's in day, month, year format, etc. The problem is that different people could generate HTML pages with different grammars, so a web crawler would have to (1) understand all the different grammars in use, and (2) know which grammar is used at each site. I'm not saying that this approach can't be taken. In fact I'm involved in an effort at school to do eactly this -- define a high- level language for specifying grammars for HTML pages so that the data can be extracted from the pages into a (gedcom-like) representation where each of the data elements is given the appropriate label. However, this approach is a lot of work. A2. GENDEX files The advantages of this approach are: the data has already been parsed so the web crawler is much easier to write, and a list of GENDEX files is already being maintained. The disadvantages of this appraoch are: not everyone generates GENDEX files, and GENDEX files don't have enough information to determine matches using information about relatives (alternative B3 above). A3. GEDCOM files What if people provided the web crawler a URL to their raw gedcom files? Then the web crawler could read the gedcom files to determine matches. The advantage of this approach is that the web crawler is easy to write, yet it has access to the relative information as well as the personal data fields for determining matches. The disadvantage is that currently nobody puts raw gedcom files on the web (that I know of anyway). Is this a fair analysis of the approaches? Which do you think is best? Is anyone working toward one approach or another? Advantages/Disadvantages of the alternatives for determining matches: I don't have much experience with genealogy (I'm a student studying databases who is _interested_ in genealogy), so I have little idea about the best way to go about determining matches. One idea I had was to allow people to assign points for matching different data elements; e.g., 5 points for an exact name match, 1 point for matching the soundex coding of a name, 5 points for an exact birthdate match, then specifying a threshold for when two people are considered matches. -Dallan http://www-db.stanford.edu/~quass From list-relay@UCSD.EDU Mon Nov 27 08:26:59 1995 Received: from UCSD.EDU (mailbox2.ucsd.edu [132.239.1.54]) by fuji.ucsd.edu (8.6.9/8.6.9) with ESMTP id IAA24665 for ; Mon, 27 Nov 1995 08:26:59 -0800 Received: from gate.microware.com (gate.microware.com [198.17.151.51]) by UCSD.EDU (8.6.12/8.6.9) with SMTP id IAA15088 for ; Mon, 27 Nov 1995 08:18:54 -0800 Received: by gate.microware.com; id AA07711; Mon, 27 Nov 95 10:16:47 CST Received: from mcrware.microware.com(192.52.109.32) by gate.microware via smap (g3.0.1) id xma007709; Mon, 27 Nov 95 10:16:46 -0600 Received: from wales (wales.microware.com) by mcrware.microware.com with SMTP id AA26145 (5.67a8/IDA-1.5); Mon, 27 Nov 1995 10:18:25 -0600 From: Scott McGee Received: by wales id ; Mon, 27 Nov 95 10:18:23 CST Dat