Distribution of distances in my one place study database

Question

Distribution of distances in my one place study database

307 views

I have spent some time emulating the WBE distance calculator in my own database with some 40K+ profiles. Here is the output of a query that shows the distribution of distances:

What strikes me first, is the symmetry of the distribution. It looks like a normal distribution, which is a little unexpected.

Also, I'm amazed that I have connected more than 90 % of the profiles, from a compilation of parish records, probates and censuses of a handful of parishes roughly between 1650 and 1850. It seems like most of the unconnected profiles are servants entered from the 1801 census. Those are notoriously hard to identify for lack of context, and many of them have surely been misnamed.

I had a hunch that about 80 % were connected. ATM, I have put almost 32,000 profiles on Wikitree, so I still have some 6,000 to go before I can turn to other things.

asked Jan 7 in The Tree House by Leif Biberg Kristensen G2G6 Pilot (211k points)
retagged Jan 8 by Leif Biberg Kristensen

3 Answers

Thank you for your answer, Mark.

Yes, I am the focus person, because then I can easily compare the distance given by WBE and my own internal distance. I have excluded a class of profiles called "private"; they are mostly living close family, and a few poorly sourced names, currently 1,038 profiles from a total of 42,418.

There are a few of my own ancestors coming from outside of the focus area, but hardly enough to make a significant impact on the distribution. (They mainly fall within the 4.5 % "CC7" group.) Also, there are some varying degrees of coverage between the parishes, For instance I have full coverage of the 1801 census from Solum, Holla, Bamble, and Porsgrunn. For Skien, Gjerpen, Eidanger, Brevik, and Drangedal I have mostly included those with close relations in the first four "core" parishes.

I have never tried to keep track on the number of profiles from each parish; one of my main interests has been to uncover the internal migration in the area, which is considerable.

There are also, naturally, lots of profiles where either birth or death records are missing, or both. Most of those evidently have entered or left the area, and have been impractical to trace further, although in many cases yoiu can read from probates that such-and-such are living in remote places. But the overwhelming majority actually stayed in their home parish from the cradle to the grave. That is the fact not least for those who died before the age of eighteen, of which I can count a whopping 10,618 profiles in my database.

commented Jan 8 by Leif Biberg Kristensen G2G6 Pilot (211k points)
edited Jan 8 by Leif Biberg Kristensen

Related questions

+10 votes

2 answers

230 views

My connection count went down by one in Degree 7. Is there an easy way to find out what changed?

asked Jul 29, 2023 in WikiTree Help by Barb Lee G2G1 (1.7k points)

+5 votes

1 answer

116 views

Is there an easier way to view CC7 for non-notables?

asked Mar 18 in The Tree House by Anonymous Wimble G2G6 Mach 2 (23.3k points)

+8 votes

2 answers

223 views

CC7-views offers incomplete information

asked Feb 17 in WikiTree Tech by Klaas Jansen G2G6 Mach 4 (43.7k points)

+9 votes

2 answers

224 views

CC7 Views changed?

asked Feb 14 in WikiTree Tech by Klaas Jansen G2G6 Mach 4 (43.7k points)

+9 votes

2 answers

222 views

CC7 QEII Distance

asked Jan 27 in The Tree House by Klaas Jansen G2G6 Mach 4 (43.7k points)

+17 votes

5 answers

524 views

How can I find out who is increasing my CC7 count. It increased by nearly 200 overnight.

asked Dec 6, 2023 in WikiTree Tech by Kimberly Ann Lindsay G2G6 (9.7k points)

+5 votes

2 answers

152 views

CC7 and adopted

asked Nov 8, 2023 in The Tree House by NG Hill G2G6 Mach 8 (85.6k points)

+17 votes

0 answers

367 views

I Suggest an Improvement - a CC7 Recent Change List/Summary

asked Sep 26, 2023 in WikiTree Help by Laura Ward G2G6 Mach 4 (46.5k points)

+11 votes

2 answers

319 views

Is it possible to view specifics of CC7 changes?

asked Sep 25, 2023 in WikiTree Help by Laura Ward G2G6 Mach 4 (46.5k points)

+7 votes

3 answers

334 views

Problem with CC7

asked Aug 24, 2023 in WikiTree Tech by Judy Bramlage G2G6 Pilot (214k points)

Answer 1 · 2024-01-08T14:31:06+0000

Shawn, thank you for an interesting and informative answer. Yes, the similarity to a Gaussian distribution is probably incidental.

The study is from my own birth place, the South-East corner of Telemark county, Norway. The area is called "Grenland", but with somewhat diffuse boundaries. In my first reply to Mark above, I have enumerated the parishes and the approximate time period, ie. 1650-1850.

I started with genealogy in 1997 with my own ancestors, but very soon I bought film copies of parish registers and started to transcribe and enter them systematically. Today I have a tree-structured source table with 53,600 baptisms, 51,204 burials, and 21,090 weddings, among others.

You can read some more on my profile page, and in the description of my database "Yggdrasil" on Google Code, and on my "Solumslekt" page.

commented Jan 8 by Leif Biberg Kristensen G2G6 Pilot (211k points)

Answer 2 · 2024-01-11T16:43:52+0000

Interesting.

This is the way I would describe the results:

* It takes a certain number of "steps" to get to the "interior" of the population. For Lief's table, he's obviously in a part of the population on the "edge", which is to say he's (1) living and (2) any spouses, children, sibilngs, and probably 1st cousins, uncles, etc. are simply not within the data set. You have to go a few steps even to get into the time frame of most of the population in the data set. The table given later, for his gt-gt-gt grandfather, bypasses this.

* Once you get to a "critical mass" of sorts, up to about 100, then the numbers roughly double (or a bit more) every time - it's exponential growth.

* UNTIL you get to the point where further exponential growth is practically impossible. "dist=10" has over 9,000, with 18,000 of the population already assigned to the various circles. There's only about 20,000 people left - if "dist=11" were over double the "dist=9" number, pretty much all the rest of the population would be in it, and then "dist=12", and all the ones after, would practically have to be ZERO.

* What happens instead is that the lack of "unassigned" people in the population limits how may you get in "dist=11". In very rough terms, it looks to me like each successive "circle" includes about half of the population that is REMAINING. This is basically the exponential process again, only in REVERSE.

* It's commonplace to call a distribution that starts low, has a "hill", and then comes back down, a gaussian (or "normal") distribution. We're told that the normal distribution shows up in reality all the time, so it's an easy conclusion to jump to. Probably, though, that ought to be avoided in cases like this, where the range doesn't even extend into negative values (since the normal distribution goes from minus infinity to plus infinity). Just look at a binomial distribution, or a hypermetric distribution, for example. There are lots of distributions that go up, and then come back down. Under certain conditions, they're even pretty symmetrical.

* Regarding the 100 circle discussion, it seems clear to me that when you have more than one peak, that represents the dominance of a somewhat isolated subpopulation within the greater World Tree population (which, although it is pretty big, it is STILL FINITE, and therefore you will get a "hill" in the distribution.) If that subpopulation is isolated solely because somebody has heavily researched their own family - and other profiles that would otherwise contribute to the curve have not been created, then that "hill" might diminish over time as new applicable profiles are added, but if it's because it's a genuinely endogamic population, then perhaps it won't.

* Cases where you see kind of a "bump" on one side of the "hill" or the other, undoubtedly just means that there are two "hills" that are too close together to be seen as distinct.

answered Jan 11 by Living Stanley G2G6 Mach 9 (91.9k points)

Frank, thank you for your thoughts on this. There are a couple of issues here in which I disagree.

A Gaussian distribution does not need to reach negative values. A simple example that contradicts this is the average height of a population, which (I believe) rarely reaches negative values. Yet it is a valid example of a Gaussian distribution. You can also roll 6 dices a few hundred times, count the points of each roll and write them down. You will find a perfect Gaussian distribution with 6 and 36 as end points, with the peak at 21.

There may be a real reason why my distribution looks "Gaussian", but it may also be coincidental.

"Normal" growth rates between circles does not apply in a one-place study. As its focus is a semi-rural, small-town area with a high degree of endogamy, you will find that most of the population is related to each other. The connection count will taper off at (roughly) the same rase that it rises, because there will be ever less connections to make as the coverage within the area is being saturated. There is no "remaining" population here. It seems like you assume that my study covers a small subset of the entire population in this area, which today counts about 120,000 persons, after an explosive growth in the period 1850-1950.

My study is not about my own family. It is about the entire population of a given area in a given time period. My own "family" plays a minor role in this: the seven first distances only make up four and a half percent of the entire dataset. The connection to my own family is of course not entirely incidental, as I chose for my study the area where both my parents had their roots.

Neither have I tried hard to hunt relatives outside of the given area, which of course might have boosted my CC7 - which I find rather uninteresting. Rather than spending endless hours leafing through church books all over the place, I find it infinitely more rewarding to transcribe them from end to end, and enter them into my database, where I can then create profiles just by "connecting the dots". That is how I have built up a database with 42,000 profiles, all sourced as fully as possible with transcripts of original records.

What I have done - and still do - is usually called "family reconstitution", ie. the process of piecing together profiles and families from original sources such as church records, probates, censuses, etc. I have even developed my own database, both backend and frontend, for this study.

BTW my name is Leif, not Lief.

commented Jan 11 by Leif Biberg Kristensen G2G6 Pilot (211k points)

Hi Leif,

1) Sorry I spelled your name incorrectly - we don't have a lot of Norwegians around here.

2) Your example about the average height of a population is OK, but even then you need to keep in mind that it's only an approximation of a normal distribution. If it were a 100% true normal distribution, there would be some infinitesimal probability of those negative heights (which is, of course, impossible). The thing is, you can undoubtedly go many standard deviations away from the mean and not come close to zero, so you can EASILY get away with that normal distribution approximation, and be very accurate with it. Your "distance" distribution does NOT have that going for it, so it's on shaky footing to start with.

We had a similar discussion on here some years ago about the probability distribution of cM values, for a given relationship level. I tried to point out that while an author of some blog that some people were familiar with had made the assumption that the distribution would be a normal distribution, that I had empirically seen what the distribution was in my own data set, and it was not even close to gaussian. (They wouldn't listen).

I'm not saying that your distribution is "not even close", but I've also had occasion in an entirely different technical setting where there was an insistence on using a gaussian where the relevant properties of the distribution were simply nothing like that of a gaussian, aside from being big in the middle, and tapering off on the ends.

It's just unwise to tether yourself to that description, if that's what you're doing.

3) If you add even just three random variables together, each independent, with identical uniform distributions, it starts to look somewhat gaussian. Adding 6 dice is adding together 6 such random variables, so yeah, it might do pretty well. I would not call it "perfect", especially since it is a discrete distribution, but it might match up pretty well. It also probably has the problem of not going out very many standard deviations from the mean - so the "tails" of the distribution would have to be inaccurate. The normal distribution would probably tell you you have a one in a million chance (or some such low number) of rolling a negative number. So the approximation is OK, as long as the application you're looking at doesn't care about what it says about that one in a million (or whatever) case.

4) It sounds like you're trying hard to disagree with me, with your talk of a "small town area", but you're not saying anything that's any different. The "population" I'm talking about refers to one of several things, depending on the context. First, is the 38,000 in your distribution. I also spoke of it in the context of other isolated populations that other people might be related to, resulting in an additional "hill" in their "100 circles" distribution. Finally, I'm referring to the 30M+ population of WikiTree, which results in the main "hill" on anybody's "100 circles" distribution.

You described you research as encompassing virtually all of the population of the area within a given timeframe. You say it's not about your own family, but you say you're in the database, and the people in it are heavily interrelated - and I didn't say anything about it being only about your family anyway. You say that there are 120,000 living there now - I would assume that most of them are NOT in your database, but that most of their ancestors ARE, so actually your database probably has about 1/4 of ALL the population. So your study DOES "cover a small subset of the entire population", but that's completely irrelevant to what I was saying anyway. I made no such assumption, nor was any such assumption implied.

The only relevance of your own family to the discussion is to explain the first few numbers in your distribution, which I describe as simply making your way to the main part of your population. Your "dist=1" count is 2, which apparently means it's just your parents. If you have a spouse, or siblings, or children, they are (hopefully) alive, but do not appear in the count because you haven't put many living people in your database. The "dist=2" count being 5 is likely your grandparents, plus some other relation. These first few numbers are clearly what they are because of who you have included in the database from your immediate family. That's not a criticism - that's just explaining why the first few numbers (which are all about your own immediate family) are kind of artificial, and just about getting us from you to the main part of the population. It's the numbers that come after that which are what the real discussion is about.

My own "degree 1" number is 3, but if all people who ever lived had a WikiTree profile, it would be 8. My "degree 2" number, which is 13, would be doubled. I just don't see much value to adding profiles that only I can see (normally), and which can cause problems too.

5) You say 'The connection count will taper off ... because there will be ever less connections to make as the coverage within the area is being saturated. There is no "remaining" population here.' Actually, the "remaining" population I'm referring to is what's left of the 38,000 after the rest of them have already been assigned to the various previous circles (or "distances"). "The coverage ... being saturated" that you referred to literally occurs when the "remaining" population I referred to is down to less than about half of the whole population (the "whole" being the 38,000).

It's all about how when an individual with the database is assigned to a "distance" that they are removed from consideration, as far as being assigned to subsequent "distances". The people that have not been "removed" in that way are the ones that "remain" to still be considered for being assigned to the higher numbered "distances". I can't imagine what else you might have thought "remaining population" might mean in this context, but apparently you took it the wrong way.

6) As you can see, this distribution has exponential growth leading up to the top of the "hill", and exponential decay once you're past it. A normal distribution may look somewhat exponential in the tails, but is fairly linear leading up to (and after) the "hill". It doesn't really fit what we're seeing here, aside from the very crude "being big in the middle and dropping toward zero at the tails".

commented Jan 11 by Living Stanley G2G6 Mach 9 (91.9k points)

1	8
2	15
3	39
4	157
5	589
6	1700
7	3398
8	5457
9	9519
10	17613
11	34200
12	65829
13	121455
14	220555
15	373668
16	577725
17	842105
18	1215693
19	1775948
20	2444827

1	2
2	5
3	31
4	86
5	195
6	408
7	967
8	2330
9	5369
10	9006
11	8098
12	5600
13	3649
14	3473
15	4886
16	8657
17	18388
18	43978
19	110547
20	262738

Categories

Distribution of distances in my one place study database

Please log in or register to add a comment.

Please log in or register to answer this question.

3 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions