Distribution of distances in my one place study database

+14 votes
307 views

I have spent some time emulating the WBE distance calculator in my own database with some 40K+ profiles. Here is the output of a query that shows the distribution of distances:

What strikes me first, is the symmetry of the distribution. It looks like a normal distribution, which is a little unexpected.

Also, I'm amazed that I have connected more than 90 % of the profiles, from a compilation of parish records, probates and censuses of a handful of parishes roughly between 1650 and 1850. It seems like most of the unconnected profiles are servants entered from the 1801 census. Those are notoriously hard to identify for lack of context, and many of them have surely been misnamed.

I had a hunch that about 80 % were connected. ATM, I have put almost 32,000 profiles on Wikitree, so I still have some 6,000 to go before I can turn to other things.

in The Tree House by Leif Biberg Kristensen G2G6 Pilot (211k points)
retagged by Leif Biberg Kristensen

3 Answers

+8 votes
 
Best answer
Interesting analysis.

Is the reference person yourself? Or some centrally located person?

Does the database have any connections within it from outside the study area or study time period?
by Mark Dorney G2G6 Mach 6 (65.6k points)
selected by Leif Biberg Kristensen
Thank you for your answer, Mark.

Yes, I am the focus person, because then I can easily compare the distance given by WBE and my own internal distance. I have excluded a class of profiles called "private"; they are mostly living close family, and a few poorly sourced names, currently 1,038 profiles from a total of 42,418.

There are a few of my own ancestors coming from outside of the focus area, but hardly enough to make a significant impact on the distribution. (They mainly fall within the 4.5 % "CC7" group.) Also, there are some varying degrees of coverage between the parishes, For instance I have full coverage of the 1801 census from Solum, Holla, Bamble, and Porsgrunn. For Skien, Gjerpen, Eidanger, Brevik, and Drangedal I have mostly included those with close relations in the first four "core" parishes.

I have never tried to keep track on the number of profiles from each parish; one of my main interests has been to uncover the internal migration in the area, which is considerable.

There are also, naturally, lots of profiles where either birth or death records are missing, or both. Most of those evidently have entered or left the area, and have been impractical to trace further, although in many cases yoiu can read from probates that such-and-such are living in remote places. But the overwhelming majority actually stayed in their home parish from the cradle to the grave. That is the fact not least for those who died before the age of eighteen, of which I can count a whopping 10,618 profiles in my database.

Just for fun, I reran the distribution count with my 3ggf Isach Abrahamsen as focus person:

38055 distances, average distance 7.863. 91.965% connected.

It seems like he is a lot closer to the center of the distribution.

And for completeness' sake, the distance count for profiles already entered into Wikitree with myself as focus profile again, ie. the intersection between my own database and Wikitree:

The percentiles are off because they are calculated from the entire dataset, not from the current selection. OTOH, they show that I'm 84 % "done" cool

Neat. The distribution around your 3GGF is more spherical than the more hemispherical distribution around yourself, resulting in the lower mean connectivity.

I often wonder what the theoretical connectivity is for a certain sized populations with certain characteristics and what you have here is a dataset that could be used to validate mathematical models of connectivity.

I’m also curious to know if there is perhaps some sub section of the population that is largely self contained and has few connections to the broader population.
There was certainly a sharp division between the upper class and the population at large, and it was rarely crossed, but for the occasional illegitimate child. I have bothered little with upper class profiles, but a few of them are connected in my dataset. For instance, the group of 14 profiles at distance 22 are the wife and children of one of the two distance 21 profiles, the bailiff Bendix Plesner (1769-1827). He is quite closely related to Henrik Ibsen and the entire Skien upper bourgeoisie. I couid probably shrink the distance with a little research.

After all, the most unlikely relatives are only a wedding away.
As for the "theoretical connectivity" I have a hunch that you can connect more than 90 % of a random population within a radius of say 50 kilometers in 10 or 11 steps, at least if we are talking about the time before the great 19th century migrations.
I would agree with that, for the endogamic area of my paternal ancestors. https://cghp-poher.net/aire-d-etudes.html

My focus profile Vatant-5 was born in Paule, right in the middle of the map.
I’m also focused on thoroughly building out a discrete population, although a 19th century one with a fair bit of both in and out migration.

I can’t calculate counts by distance like you have, but I manually looked at the distance for a hundred people born in the same year and colony as my GGGM and the median distance to them is nine. (One hundred is a little higher than half the total births that year)

These are children born to parents who had arrived in the colony no more than fifteen years earlier, and the parents were almost all initially unrelated to each other.

Mark, I ran the 100 Circles query on your profile. You are better connected than Leif, and your distribution is unimodal with a peak at d=23. So, your "local cluster" is hidden behind the general steep growth.

1 8
2 15
3 39
4 157
5 589
6 1700
7 3398
8 5457
9 9519
10 17613
11 34200
12 65829
13 121455
14 220555
15 373668
16 577725
17 842105
18 1215693
19 1775948
20 2444827

 

Regarding the question of a self-contained subsection, are you referring to this OPS or more generally? People can be organized by religion, a good example being the Amish. Class distinction might be/have been true in Europe, but not really the case in the US (it exists, but a much fuzzier line).

My study hasn't gotten that far yet, but a Catholic vs. Protestant line might exist. We have immigrants from all over northern Europe, and marrying within one's own ethnic group only lasts for about one generation. After that, they're all speaking English, and Germans will marry Norwegians or British or whatever.
Rob, I was just curious to learn if Leif had identified such a group in his work.

In my work, and to echo something Leif said, even though I have an “upper class” that tends to marry outside the area, there’s just enough illegitimate births to keep everything well connected. Also, with a population that skewed male there was plenty of opportunity for women to “marry up” and cross between economic, social or religious groupings.
+7 votes

We have seen a similar shape with 100 Circles (which is roughly what you are doing, but for the whole DB). In my experience the distribution is not exactly a normal distribution, but is slightly steeper on the left and less steep on the right. I think it ended up fitting pretty well to a log normal distribution, IIRC. No idea why that distribution either though. There is still a lot that seems to be unknown about the structure of genealogy networks.

If you squint closely at your distribution it also seems to be slightly steeper at the top, so I think it matches the general pattern.

What place is this? You have left us with so much mystery :)

by Shawn Ligocki G2G6 Mach 2 (30.0k points)

Shawn, thank you for an interesting and informative answer. Yes, the similarity to a Gaussian distribution is probably incidental.

The study is from my own birth place, the South-East corner of Telemark county, Norway. The area is called "Grenland", but with somewhat diffuse boundaries. In my first reply to Mark above, I have enumerated the parishes and the approximate time period, ie. 1650-1850.

I started with genealogy in 1997 with my own ancestors, but very soon I bought film copies of parish registers and started to transcribe and enter them systematically. Today I have a tree-structured source table with 53,600 baptisms, 51,204 burials, and 21,090 weddings, among others.

You can read some more on my profile page, and in the description of my database "Yggdrasil" on Google Code, and on my "Solumslekt" page.

Shawn, the distribution of distances in a one place study (a place I suspect being quite endogamic) is hard to compare with the global distribution in the Big Tree. For the record and out of curiosity I ran the 100 Circles query on Leif's profile and it has a bimodal distribution, characteristic of a thorough local work.

Here are the global figures I got right now for the 20 first circles. The first peak around d=10 is clearly mostly Leif's local profiles.

1 2
2 5
3 31
4 86
5 195
6 408
7 967
8 2330
9 5369
10 9006
11 8098
12 5600
13 3649
14 3473
15 4886
16 8657
17 18388
18 43978
19 110547
20 262738
Over C20 the peak is at d=27, which is quite typical of an active European WikiTreer (mine is at d=26).

As for the shape of the global distribution, see examples in the 100 Circles page showing that, although a log-normal distribution seems frequent (like the QEII example), there are many cases with a "secondary bump". It's completely unclear at the moment if it's a bias of the current WT distribution, or if it's a "real world" pattern.
Bernard, thank you for the larger view of my "100 circles" count. There are a few other Wikitreers who have parts of their trees in the same area as me, but with nowhere near my numbers. As can be seen from my d=10 in the third table above, I have 8,647 Wikitree profiles compared to the 9,006 in your table, that is 96 %.

My initial thought was that a typical distribution in a fully researched and registered "ideal" world should rise monotonously, but I can also see reasons why there might be a double-bumped curve, with a "local" endogamic distribution overlaid upon a more open "global" distribution.
See the distribution for Mark in the other answer, which is the typical one one could expect indeed for "well-connected" profiles.

The so far unexplained second bump I was speaking about is *after* the peak. See in the 100 Circles page the distribution for our top connected Patty LaPlante.

Your distribution is easy to interpret, the one of Patty is much more mysterious.

Ah, I see. I looked at Patty's graph on the 100 Circles page now, and can clearly see what you mean. Mine is probably more similar to that of François du Toit, with "a heavily connected local cluster" as you say.

Indeed. Or the one of jean Joseph Vatant. The du Toit case is quite peculiar, the South African "local" cluster being quite big, in the 100k range.
+4 votes
Interesting.

This is the way I would describe the results:

* It takes a certain number of "steps" to get to the "interior" of the population. For Lief's table, he's obviously in a part of the population on the "edge", which is to say he's (1) living and (2) any spouses, children, sibilngs, and probably 1st cousins, uncles, etc. are simply not within the data set. You have to go a few steps even to get into the time frame of most of the population in the data set. The table given later, for his gt-gt-gt grandfather, bypasses this.

* Once you get to a "critical mass" of sorts, up to about 100, then the numbers roughly double (or a bit more) every time - it's exponential growth.

* UNTIL you get to the point where further exponential growth is practically impossible. "dist=10" has over 9,000, with 18,000 of the population already assigned to the various circles. There's only about 20,000 people left - if "dist=11" were over double the "dist=9" number, pretty much all the rest of the population would be in it, and then "dist=12", and all the ones after, would practically have to be ZERO.

* What happens instead is that the lack of "unassigned" people in the population limits how may you get in "dist=11". In very rough terms, it looks to me like each successive "circle" includes about half of the population that is REMAINING. This is basically the exponential process again, only in REVERSE.

* It's commonplace to call a distribution that starts low, has a "hill", and then comes back down, a gaussian (or "normal") distribution. We're told that the normal distribution shows up in reality all the time, so it's an easy conclusion to jump to. Probably, though, that ought to be avoided in cases like this, where the range doesn't even extend into negative values (since the normal distribution goes from minus infinity to plus infinity). Just look at a binomial distribution, or a hypermetric distribution, for example. There are lots of distributions that go up, and then come back down. Under certain conditions, they're even pretty symmetrical.

* Regarding the 100 circle discussion, it seems clear to me that when you have more than one peak, that represents the dominance of a somewhat isolated subpopulation within the greater World Tree population (which, although it is pretty big, it is STILL FINITE, and therefore you will get a "hill" in the distribution.) If that subpopulation is isolated solely because somebody has heavily researched their own family - and other profiles that would otherwise contribute to the curve have not been created, then that "hill" might diminish over time as new applicable profiles are added, but if it's because it's a genuinely endogamic population, then perhaps it won't.

* Cases where you see kind of a "bump" on one side of the "hill" or the other, undoubtedly just means that there are two "hills" that are too close together to be seen as distinct.
by Living Stanley G2G6 Mach 9 (91.9k points)

Frank, thank you for your thoughts on this. There are a couple of issues here in which I disagree.

A Gaussian distribution does not need to reach negative values. A simple example that contradicts this is the average height of a population, which (I believe) rarely reaches negative values. Yet it is a valid example of a Gaussian distribution. You can also roll 6 dices a few hundred times, count the points of each roll and write them down. You will find a perfect Gaussian distribution with 6 and 36 as end points, with the peak at 21.

There may be a real reason why my distribution looks "Gaussian", but it may also be coincidental.

"Normal" growth rates between circles does not apply in a one-place study. As its focus is a semi-rural, small-town area with a high degree of endogamy, you will find that most of the population is related to each other. The connection count will taper off at (roughly) the same rase that it rises, because there will be ever less connections to make as the coverage within the area is being saturated. There is no "remaining" population here. It seems like you assume that my study covers a small subset of the entire population in this area, which today counts about 120,000 persons, after an explosive growth in the period 1850-1950.

My study is not about my own family. It is about the entire population of a given area in a given time period. My own "family" plays a minor role in this: the seven first distances only make up four and a half percent of the entire dataset. The connection to my own family is of course not entirely incidental, as I chose for my study the area where both my parents had their roots.

Neither have I tried hard to hunt relatives outside of the given area, which of course might have boosted my CC7 - which I find rather uninteresting. Rather than spending endless hours leafing through church books all over the place, I find it infinitely more rewarding to transcribe them from end to end, and enter them into my database, where I can then create profiles just by "connecting the dots". That is how I have built up a database with 42,000 profiles, all sourced as fully as possible with transcripts of original records.

What I have done - and still do - is usually called "family reconstitution", ie. the process of piecing together profiles and families from original sources such as church records, probates, censuses, etc. I have even developed my own database, both backend and frontend, for this study.

BTW my name is Leif, not Lief.

Hi Leif,

1) Sorry I spelled your name incorrectly - we don't have a lot of Norwegians around here.

2) Your example about the average height of a population is OK, but even then you need to keep in mind that it's only an approximation of a normal distribution. If it were a 100% true normal distribution, there would be some infinitesimal probability of those negative heights (which is, of course, impossible). The thing is, you can undoubtedly go many standard deviations away from the mean and not come close to zero, so you can EASILY get away with that normal distribution approximation, and be very accurate with it. Your "distance" distribution does NOT have that going for it, so it's on shaky footing to start with.

We had a similar discussion on here some years ago about the probability distribution of cM values, for a given relationship level. I tried to point out that while an author of some blog that some people were familiar with had made the assumption that the distribution would be a normal distribution, that I had empirically seen what the distribution was in my own data set, and it was not even close to gaussian. (They wouldn't listen).

I'm not saying that your distribution is "not even close", but I've also had occasion in an entirely different technical setting where there was an insistence on using a gaussian where the relevant properties of the distribution were simply nothing like that of a gaussian, aside from being big in the middle, and tapering off on the ends.

It's just unwise to tether yourself to that description, if that's what you're doing.

3) If you add even just three random variables together, each independent, with identical uniform distributions, it starts to look somewhat gaussian. Adding 6 dice is adding together 6 such random variables, so yeah, it might do pretty well. I would not call it "perfect", especially since it is a discrete distribution, but it might match up pretty well. It also probably has the problem of not going out very many standard deviations from the mean - so the "tails" of the distribution would have to be inaccurate. The normal distribution would probably tell you you have a one in a million chance (or some such low number) of rolling a negative number. So the approximation is OK, as long as the application you're looking at doesn't care about what it says about that one in a million (or whatever) case.

4) It sounds like you're trying hard to disagree with me, with your talk of a "small town area", but you're not saying anything that's any different. The "population" I'm talking about refers to one of several things, depending on the context. First, is the 38,000 in your distribution. I also spoke of it in the context of other isolated populations that other people might be related to, resulting in an additional "hill" in their "100 circles" distribution. Finally, I'm referring to the 30M+ population of WikiTree, which results in the main "hill" on anybody's "100 circles" distribution.

You described you research as encompassing virtually all of the population of the area within a given timeframe. You say it's not about your own family, but you say you're in the database, and the people in it are heavily interrelated - and I didn't say anything about it being only about your family anyway. You say that there are 120,000 living there now - I would assume that most of them are NOT in your database, but that most of their ancestors ARE, so actually your database probably has about 1/4 of ALL the population. So your study DOES "cover a small subset of the entire population", but that's completely irrelevant to what I was saying anyway. I made no such assumption, nor was any such assumption implied.

The only relevance of your own family to the discussion is to explain the first few numbers in your distribution, which I describe as simply making your way to the main part of your population. Your "dist=1" count is 2, which apparently means it's just your parents. If you have a spouse, or siblings, or children, they are (hopefully) alive, but do not appear in the count because you haven't put many living people in your database. The "dist=2" count being 5 is likely your grandparents, plus some other relation. These first few numbers are clearly what they are because of who you have included in the database from your immediate family.  That's not a criticism - that's just explaining why the first few numbers (which are all about your own immediate family) are kind of artificial, and just about getting us from you to the main part of the population. It's the numbers that come after that which are what the real discussion is about.

My own "degree 1" number is 3, but if all people who ever lived had a WikiTree profile, it would be 8. My "degree 2" number, which is 13, would be doubled. I just don't see much value to adding profiles that only I can see (normally), and which can cause problems too.

5) You say 'The connection count will taper off ... because there will be ever less connections to make as the coverage within the area is being saturated. There is no "remaining" population here.' Actually, the "remaining" population I'm referring to is what's left of the 38,000 after the rest of them have already been assigned to the various previous circles (or "distances"). "The coverage ... being saturated" that you referred to literally occurs when the "remaining" population I referred to is down to less than about half of the whole population (the "whole" being the 38,000).

It's all about how when an individual with the database is assigned to a "distance" that they are removed from consideration, as far as being assigned to subsequent "distances". The people that have not been "removed" in that way are the ones that "remain" to still be considered for being assigned to the higher numbered "distances". I can't imagine what else you might have thought "remaining population" might mean in this context, but apparently you took it the wrong way.

6) As you can see, this distribution has exponential growth leading up to the top of the "hill", and exponential decay once you're past it. A normal distribution may look somewhat exponential in the tails, but is fairly linear leading up to (and after) the "hill". It doesn't really fit what we're seeing here, aside from the very crude "being big in the middle and dropping toward zero at the tails".

Related questions

+5 votes
1 answer
+8 votes
2 answers
+9 votes
2 answers
224 views asked Feb 14 in WikiTree Tech by Klaas Jansen G2G6 Mach 4 (43.7k points)
+9 votes
2 answers
222 views asked Jan 27 in The Tree House by Klaas Jansen G2G6 Mach 4 (43.7k points)
+5 votes
2 answers
152 views asked Nov 8, 2023 in The Tree House by NG Hill G2G6 Mach 8 (85.6k points)
+17 votes
0 answers
+11 votes
2 answers
319 views asked Sep 25, 2023 in WikiTree Help by Laura Ward G2G6 Mach 4 (46.5k points)
+7 votes
3 answers
334 views asked Aug 24, 2023 in WikiTree Tech by Judy Bramlage G2G6 Pilot (214k points)

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright

...