Unsourced Profiles Status Post Source-a-Thon

+14 votes
424 views

As promised in a previous question, I continued to track Unsourced profiles through and after the 2021 Source-a-Thon.  I've updated the free space page with the results.  As predicted, the SaT resulted in a net reduction of about 38,000 unsourced profiles.  Since then, the number of profiles marked {{Unsourced}} has returned to the low levels of growth or decline observed a few months before the event. 

I see no benefit to further tracking of the {{Unsourced}} template, so there will be no more regular weekly updates to the page.  The underlying problems remain, that members are free to add profiles without sources, and to leave them so indefinitely, and that WikiTree has no way to count the actual numbers.

WikiTree profile: Space:The_Sourcing_Loophole
in Policy and Style by Living Tardy G2G6 Pilot (771k points)

4 Answers

+9 votes
Herbert, I also despair about the level of unsourced profiles.  But my greater despair is the weekly increasing data doctor suggestions.  I usually work 2 states, Georgia and Ohio.  The weekly suggestions (errors) increase by more than 500 each week, excluding FindAGrave suggestions.  I try to correct that many per week but am obviously falling behind because I don't usually have time to do more than location suggestions.

Thank you for your efforts to count and post about the unsourced.
by Kathy Zipperer G2G6 Pilot (480k points)
+13 votes
First and foremost, I don't think anyone should despair. Clearly there are a lot of profiles which lack sources, but what you are not measuring is the rate at which those are created. We know there are a huge swath from some time ago, but we don't know how many are "created" versus just "identified" on a daily basis.

Secondly, your efforts to bring this issue to the forefront have helped for visibility, and there are teams of people who are participating in Data Doctors or Sourcing and who are working to fix these issues. The more profiles identified with the template the better, as then they can be corrected.

Similarly, identifying suggestions does not mean they all need to be corrected, it is identifying possible correction or opportunity to research additional information. As the tree grows, those numbers will also grow. Every suggestion that is cleared corrects an issue for an entire family of people.

Additionally, your selection of profiles from the group consisting of missing birth and death dates is perhaps skewing your statistics (aside from the other unknowns related to templates not on profiles/etc.). I am also of the opinion that sources behind a paywall are entirely valid, and not sure why you included that in your list. Makes it harder to check, but so does a book that is not digitized yet and/or only exists in a book shop or a library somewhere, neither is invalid.

So, thank you for your time and effort here, I am not in any way opposed to requiring a source be entered when adding profiles, but since the fields are free-text, validating a "valid" source versus just "because Jon said so" is next to impossible. If the numbers decrease significantly after Source-A-Thons, remain a small percentage of total profiles and newly created/identified profiles are sourced, then we are "directionally correct" and should be proud of the work everyone puts in to create accurate profiles.

*edit: typo
by Jonathan Crawford G2G6 Pilot (286k points)
edited by Jonathan Crawford

Herbert's exclusion was to trees or profiles behind paywalls, not to real sources behind paywalls. The former is useless unless you have a subscription, because there is no way to tell whether the tree or profile is based on real sources or just on fantasies. The latter is useful, because at least it confirms that there is a source and the reader can evaluate how much credibility to give the assertion based on that source. 

Ah, ok, thank you Stu!

Thank you Jonathan.  To address your points in order:

1.  We are not able to measure the rate of change in the number of profiles without sources, because we are not able to get an accurate count of same without extremely laborious, essentially manual, review.  Kay's BioCheck app helps, but the process is still time-consuming and produces statistical estimates, not actual counts.  That is the essence of the issue.

2. I agree the more profiles identified the better, and that we can't correct profiles unless we identify them. The number identified with the {{Unsourced}} template rises and falls independently of the actual (unknown) number of profiles without sources.  As long as that holds true, the number of profiles identified has no utility for monitoring progress toward reducing the problem.  The remedy for this is to identify all profiles without sources.  That's a task for a bot.

3.  The issue has nothing to do with Suggestions, except to the extent Suggestions can help identify profiles without sources.  See point #2, above.

4.  My main analysis of this issue did not filter for presence or absence of birth or death dates in any way.  In particular, the study of 160 profiles, that yielded the estimate of 4.9 to 8.1 million profiles, used "a sample taken from all Open profiles" regardless of whether the date fields contained data.  Your criticism about "selection of profiles from the group consisting of missing birth and death dates" has no foundation in fact.

5. I rejected as sources "links to trees or profiles behind paywalls," not records behind paywalls. Such a thing put the profile in my "unsourced" bin only if no other valid source was provided.  Selection criteria defining a valid source have long been discussed on G2G and need to be established in order to identify all profiles without sources (as I maintain WikiTree must do).  In fact, each member now adding or removing an {{Unsourced}} template applies his or her own subjective criteria, which may be more or less stringent than what I used.

6.  Difficulty in detecting sources within free text does not justify not attempting it.  The BioCheck app shows that it can be done with some success.  Once in place, a detection algorithm can be expected to improve over time.

7.  Your statement beginning "if the numbers decrease significantly" invites untested or meaningless comparisons.  I do not mean to diminish in any way the efforts of those involved in the Source-a-Thon.  38,000 profiles removed from the total is a very good thing in which the participants can and should take pride.  Is it "significant?"  Compared to the estimated total of several million profiles without sources, I would say No.  If comparing to newly created profiles without sources, no one knows that quantity, so who can assess the significance?  If comparing to newly identified {{Unsourced}} profiles, please note that about 22,000 profiles were "identified" in the month preceding the Source-a-Thon.  See point #2, above.  As to whether we are "directionally correct," that depends on whether your "If" applies, and no one can judge that with any level of objective accuracy.  That is the essence of the issue, which brings us full circle back to point #1, above.

Edited to correct typo.

Thanks Stu!  You are absolutely correct.

3. I was actually referring to Kathy's comment, and should not have addressed them both in the same reply, sorry for the confusion. 

4. Convenient of you to dismiss my point and claim "no foundation in fact" so flippantly, when in your own words for the study of 40 profiles you 

collected forty profiles from the Data Doctors report of 8 August 2021, using the text query “B0 D0 Open,”

That you then did not use that criteria for the following 160 profile analysis makes me feel better about your findings, thank you for clarifying what I did not find to be obvious. 

6&7. I do not think the doom and gloom approach is warranted, as it appears a very real effort is being made to correct the issues that exist today. Indeed, even if we could measure all of the unsourced profiles accurately, that wouldn't solve the problem, because they would still need to be found. The fact that you can't programmatically prevent people from entering invalid sources (instead of nothing) means that making that a requirement is a strawman as well. Calculating the total number of unsourced profiles across the entire dataset is meaningless to any one person's particular family line, if it is well-sourced. 

Therefore, we should instead celebrate the success of the platform as a whole, the policies that we have in place, and continue to encourage everyone to source their profiles and correct where they see issues.  

I cede the remainder of my time to the floor, Mr. Chairman, and thank you for listening. 

And this goes for Unknown profiles as well!!!
+3 votes
The monthly Sourcerers challenge has a number of dedicated xperienced knowledgeable members who clear about 1000 or more a month.   But yes fir every one cleared we seem to get new ones.   There are also weekly Saturday sourcing sprints so lots of opportunity to source.   We just posted the Nov challenge thread in G2g
by Laura Bozzay G2G6 Pilot (843k points)
+5 votes
Herb, I wish you would continue to update the page.  I think it serves as a reminder of the problem, if not a reliable measure of it.

I am a person who concentrates mostly on my own ancestors, with an occasional diversion into matters of historical interest (like accused witches).  So, sure, I can work on my own ancestors and not concern myself with all the garbage profiles on the rest of the tree.

But why?  Don't those of us who spend our hours on WikiTree have some stake in the reputation of the tree as a whole?  And if visitors--or members--do a search and encounter more profiles than not that are...well, pretty lacking...doesn't that reflect on all of us and our credibility?

There isn't any question at all that new, very inadequate profiles are being created every single day on WT, either in defiance, or ignorance, of the sourcing requirements.  And right now, you seem to be the only one making an effort to track that.
by Living Kelts G2G6 Pilot (555k points)
I take umbrage at that statement Julie! It discounts all the work that Ales has done for WikiTree+ and integration here, to those who developed the template, the Sourcerers who habitually work them, the efforts of Bernard and his Belle Epoque within the 100 circles (I'm just abusing the lack of accents here, sorry everyone), every project that maintains a list of Unsourced profiles within the project, etc.

It *is* being tracked elsewhere, but what Herbert is doing is restating it differently, it's hard to do, and is just pointing out what we already know.

He does have a very good point in that requiring something to be entered would make it harder, but it wouldn't fix it if someone was inclined to make a completely bogus but well-formatted citation. I know for a fact at least one member goes in and adds/removes spaces in their profiles every month, and adds change notes of random letters, so that's not beyond the realm of possibility.
Sorry, Jonathan.  I did not mean to discount anyone's work.  I know that many people are working to make WT better.  All I meant was that Herb seems to be bringing our attention to the ongoing issue in a way that others aren't.

Isn't it kind of a case of some people bailing out the ship, while others--while appreciating their efforts--think it would help to plug the leak?
Again, why is his saying "we have a lot of unsourced profiles" different from others who know we have a lot of unsourced profiles? Herbert is using Ales' application to even do his analysis?

Instead of plugging the leak, this would change the leak somewhat but not stop it up completely. In the meantime, it's just standing and pointing at the leaks and saying "we probably have a lot more" without actually fixing the ones we have.

Thanks Julie.  To be honest, I consider continued weekly reports of the number of {{Unsourced}} templates a waste of time.  As mentioned earlier, the number of templates fluctuates independently of the real number of profiles without sources, which makes it worse than unreliable as a measure of the problem.  To expand on that statement:  When the number of templates increases, we can't distinguish between members adding templates to existing profiles (as they did before the SaT) and members creating new "promise to source" profiles that receive the template automatically.  When the number decreases, we can't distinguish between members adding sources to clear templates and members removing templates because they don't like having the template on "their" profile.  The real number also increases and decreases regardless of the presence of the template, with members creating "Unsourced family tree" and "First hand knowledge" profiles (which do not get an automatic template) or adding sources to profiles that never received a template.

I will probably do additional statistical studies, and I think Paul Gierszewski will too.  The numbers work against us there, because it takes large samples (and lots of work) to see small changes in large numbers.  Or, it takes a long time to accumulate changes large enough to see with smaller samples.

Jonathan, on WT the work of ordinary members (without Leader or Admin superpowers) is pretty transparent.  You can check Herb's contributions list, or mine, or anyone's, and our G2G profiles, just as we can check yours, and we can all make our own judgments about who is making valuable contributions to WT.  So I'll just say that isn't all Herb does.
Jonathan, with respect, I think you are misstating the situation and deflecting the discussion into emotional territory.  I'm saying much more than "We have a lot of unsourced profiles."  I'm saying we don't know how many we have, we don't know whether Sourcerers and others are gaining or losing ground, and if we take any steps to improve the problem, we will have no idea whether said steps work.  I'm also saying that focusing on what happens with the number of templates provides an utterly unrealistic picture of the state of things.  None of which is a criticism of Aleš or Sourcerers or anyone else working to improve our tree, and I resent your repeated efforts to imply otherwise.  And no one has said anything about not fixing the deficient profiles we already have.  And before you suggest that I'm "just standing and pointing at the leaks" you might take a look at my Contributions.

Related questions

+23 votes
2 answers
+48 votes
28 answers
3.8k views asked Jul 28, 2021 in Policy and Style by Living Tardy G2G6 Pilot (771k points)
+79 votes
6 answers
+6 votes
1 answer
169 views asked Sep 30, 2018 in Genealogy Help by Tina Chase G2G6 Mach 3 (32.2k points)
+18 votes
9 answers
+20 votes
2 answers

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright

...