What improvements would you make to Search/Matching? [closed]

+38 votes
876 views
Hi WikiTreers,

The next big project the tech team is going to work on is improving search and matching/duplicate finding.

There have been a lot of suggestions over the years, but I know not everyone posts when they encounter a bug or have an idea for how something can be improved. So I'm asking for your ideas and suggestions!

If possible, please include an example of the input data and what sort of results you expect to get/don't expect to get.

Thanks!
closed with the note: Thanks for all your feedback! We are going to start working on this now :)
in The Tree House by Jamie Nelson G2G6 Pilot (638k points)
closed by Jamie Nelson

25 Answers

+23 votes
I noticed that the redesign has moved the 'other last names' off the primary page so it's not possible for the search to be run on them.  I know plenty of times they're just alternate spellings which aren't going to make much difference, but when searching for married women it can be really important to use these since sometimes LNAB ends up being LNAFM (Last Name at First Marriage).
by Celia Marsh G2G6 Mach 6 (63.0k points)
Yes, especially with older periods where there are multiple different spellings.
True, *most* of the time it's used for Stevens/Stephens/Steven, but something like Aubinière/de l'Aubinière is going to make a big difference in search suggestions.
I was thinking more on the lines of Hutchinson/Hocheson/Huchynson etc !
As far as I know, the search that is done when we create a new profile has never looked at matches to "Other Last Names." Other Last Names and Other Nicknames are searched in Name Search, but not in Add Profile. Including Other Last Names on the profile creation form and also in the search for matches would decrease the instances of inadvertent duplicates.

I have also wished that multi-part names in the Other Last Names field (example: Van Allen) would be searched as if they were concatenated names (example: VanAllen), instead of being treated as two separate names (example: "van" and "allen"). I often deal with people who have far more than two last names to be entered (including multi-part names), and I hate the idea that the "other" names won't show up in search unless I incorrectly concatenate the names that are in two or three parts.
Seconding adding Other Last Names to the search list - I like to add both parent’s last names as the LNAB for the Mexican profiles I manage, and then put only the father’s last name in the other last names field. It’s more accurate and follows the Hispanic Naming Conventions page here on WikiTree, but makes it really difficult to find matches sometimes.
+28 votes
Eliminate all of the matches that are not even close to either the given name or the last name.  Some of those long lists, for comparison, have dozens of people who are bizarre anagrams that include wild card letters and share only a few of the same letters and make completely different words.   

Maybe restrict matches to include just two different letters in each word so instead of selecting only a few of the letters in the name, adding a bunch more letters to make names that bare no resemblance to the name of the person being matched to.   Sometimes those very long lists have only a few or no people with same given name.
by Patricia Roche G2G6 Pilot (829k points)
Start off the list of matches with people whose name, location, and dates are closest.   Don't show people whose names are not a reasonable alternate spelling, e.g. for Miller show Millar, Muller, and Mueller, but not Mulder and Milner.
This amounts to dropping the connection to name alternatives kept at WeRelate, I think. It may have been a good idea at one time, but may not be as great in the long run.
it definitely doesn't get updated enough and isn't broad enough as the base of names grows, so it ends up being not very helpful.

This amounts to dropping the connection to name alternatives kept at WeRelate, I think. It may have been a good idea at one time, but may not be as great in the long run.

It's likely we are going to do one last import from WR, and then create an interface for sysops to add/remove variants.

Sounds great, Jamie!
One issue I have had with werelate variant names is that they often do not bother to put a name pair in their database if their computer judges one name to be unusual and the two names have the same Soundex code. That may be OK on a site that uses Soundex, but it doesn't work here, and it means that the variants database doesn't include some name pairs that are chronic sources of duplicates here.

Jamie

The search function already works pretty well for me. Its the results of the searches that are my biggest problem. I won't side track your post but will suggest you open a topic of how to disengage from the WeRelate errata. I have many suggestions including not having sysops do the extra work but making wikitree'ers responsible for maintaining the related surnames. 

Nick, since the WeRelate name variants are related to search, it's fine to post your suggestions here (although maybe in a new answer since Patricia's post didn't mention WeRelate).
+31 votes

I would love to see the relevant locations show up first. For example, this person is in Madeira, Portugal:

  • Maria Francisca De Souza Minas Gerais, Brazil - ~  [compare] (De Souza-213)
  • Maria Rita de Souza 1 Dec 1785 Valbom, Gondomar, Porto [Portugal] - ~ . [compare] (Souza-329)
  • Mary Souza - ~  [compare] (Souza-90)
  • Maria Rosa Suazo 16 Jan 1772 Embudo, San Juan, New Mexico - 1831 ~ . [compare] (Suazo-27)
  • María Antonia Sáez unknown - unknown ~  [compare] (Sáez-19)
  • Maria Joaquina deSousa - . [compare] (DeSousa-2)
  • María Antonia Sáez unknown - unknown ~  [compare] (Sáez-18)
  • María Antonia Sáez 1767 Tepehuánes, Durango, México -  [compare] (Sáez-20)
  • Maria Joaquina DeSousa - ~  [compare] (DeSousa-4)
  • María Sáez Hornos (Logroño) - ~. [compare] (Sáez-1)
  • Maria Mendes De Sousa - ~  [compare] (De Sousa-99)
  • Ana María Sosa 1775 - ~  [compare] (Sosa-141)
  • Maria Cecilia Sosa 1775 - ~  Flores. [compare] (Sosa-155)
  • Maria Alvares Teixeira Sousa - ~  Heinzelmann. [compare] (Sousa-25)
  • Maria Sousa Bandeiras, Madalena, Pico, Açores [Portugal] -
by Mindy Silva G2G Astronaut (1.1m points)
I agree heartily. Better location matching was my first thought when I saw this question.

I think there has been talk about using similar options in the matching step of the creation process to those available on the Search page; that would be nice. If i came across a Maria Suse (imaginary example) born in Sweden, it would be nice not to have the Marias born in Portugal, Brazil, et cetera show up at all. Maria Sousa with birth place unknown, I would have to check.

But when I think of the combinations needed to avoid duplicates my mind boggles...
+37 votes
Please please please filter out locations that are completely incompatible with the birthplace entered, based on continent, country and province/state/county.  It takes so long checking lists of predicted matches for people with common surnames whose birthplace renders them implausible as a match.
by Corinne Morris G2G6 Mach 2 (25.9k points)
Amen and hallelujah to this! If my person was born in USA and I'm getting hits from England, New Zealand, and Australia, not helpful!
Beyond that is the problem of the large numbers of profiles which still have no location data at all. You often have to open those profiles to make sure they are not a match.

The comparison algorithm in the Create Profile code may not be able to help much here. It is good that, after a recent change, new profiles must have at least one location. However, in the medium term, frequent Data Doctor challenges to add locations to profiles that don't have any could go a long way to speeding up match checking during new profile creation.
I agree totally Jim.  This is why I've personally made it a priority for a couple of years or more to look for a birthplace for as many of the blank birthplaces that come up as suggested matches for me as I can, and I'm willing to carry on doing that.  It's actually been a great way to learn a little about sources in other parts of the world.  I'm sure there are also a lot of other people out there adding birthplaces - a lot of the blank ones I see were uploaded in GEDCOMs, and we have teams of people who focus on going through old GEDCOMs that are known to need work.  Of course any location information is valuable, but birthplace is the only location that's shown in predicted matches.
Emma, you’ll appreciate how those of us in New Zealand feel every single day especially when the managers of American profiles assume the rest of the world knows the abbreviations for American states.
+25 votes
in addition to enhanced location filtering, as already suggested

Enhance date filtering, based on the specified (or a default) plus/minus years so that potential matches that have only birth or death dates are filtered out. For example (including the plus/minus):

Eliminate potential match died before birth

Eliminate potential match married before birth

Eliminate potential match born/married after death

Look inside the potential match persons at fields we can't see due to privacy to eliminate based on dates

Depending on available dates use things like birth or death plus/minus something like 150 to eliminate obvious non matches

Completely undated profiles are obviously more difficult, but you might be able to get approximate dates from connected relatives (much like the suggestions report does)

Any common name is good for testing. There are quite a few Samuel Reed. There are more than 20K John Smith and 240 of those are undated.
by Kay Knight G2G6 Pilot (606k points)
Here as well it would be nice to have options like those on the Search page, the widen or narrow the date span.

If I have someone with an "international" name like Jan Johansson, who lived and died in Sweden with exact dates of birth and death known, I don't think I should need to compare people born ±2 years within his birth date. The same year would be enough. On the other hand, if I know that he emigrated and I don't know when or where he died, it's a lot more likely that there is an existing profile for him with a very vaguely estimated birth year, so that I might want to check years within a wider span that ±2.
As well as getting approximate dates from family members, for those without locations, you can often get approximate locations from family members, so if the new person was born in England and the suggested match without locations comes from a family of Americans, it's probably not a match. In this case, the suggested match can either be filtered out or relegated to the bottom of the list. If they are displayed (not filtered out), the person's family members' locations can be displayed instead of having a blank space where the locations should be.
Ian, I don't see how WikiTree should be making inferences like this. That takes human judgement. People should be better about putting in a general location and date (and the new profile creation system helps with that, but lots of old profiles exist with the problem).

Rob, it's all about inferences, isn't it?  Some people are suggesting that - for example - Ian McLaurin Smith is probably not a good match for John William Smith. We can infer from the fact that the names are very different that this is probably not the same person.  If the data shows that the John William Smith (for example) that I'm entering was born in England, and another John William Smith (without a birth location entered in the database) was the the middle child of five siblings, all of whom were born in Kentucky, the system can infer that this is probably not the same person.  In this case, I would suggest the system eliminate John William Smith of (very probably) Kentucky from the list of suggested matches; failing that, being a highly unlikely match, he should be relegated to the bottom of the list and the reason for his low position (the birth places of his family members) should be shown. Sorry if I just repeated myself, but this makes sense.  What do you do when there's a suggested match with no location?  You either ignore it as it's among a long list of other suggested matches, or you check out the person's profile and/or family members for hints of where they lived. The system can make this whole process much more efficient.  

In the WBE, we have the Suggested Matches Filters feature, which - after clicking a button - filters out suggested matches so that you can look for the most likely correct match.  There are three buttons: Location, Name, and Date.

Location

1 click: Filter out people by country (if they were born in a different country, or if none of their family members were born in the same country as the new person);

2 clicks: Filter by other words in the location fields.

So clicking 'Location' twice should only leave people born in a place with the same name in the same country.

Name

1 click: Filter people out according to the 'middle name' status.  If our new person has 'no middle name' checked then anyone with a name in the middle name field is removed (I should probably revise this in light of the recent changes to the system). If 'no middle name' is not checked and a person has a middle name which is different from our new person's middle name, they are filtered out.

2 clicks: Filter out people with different given or family names.

Date:

1 click: Filter out people born 2 years or more before or after our new person.

2 clicks: Filter out people born in a different year from our new person.

A third click of any of the buttons removes the applied filter, restoring the original list of suggested matches.

This feature also adds the locations of family members to the information of the suggested match if they don't have any locations and their family members do.

This isn't a perfect system, but I think each of these ideas is a valid way of narrowing down the currently (sometimes) huge list of suggested matches.

Ian, I would be careful about filtering out with and without middle names.  Many Americans aren't familiar with the naming practices of combining all the first names used by many countries (I had to learn about that myself).  I think we would miss potential matches on new profiles made by newer people (and they make many of the duplicates).
+22 votes
Location and date filtering as already suggested are big ones.

Also when the input is only a name, like when searching from the home page or a profile page, etc., I'd like for the top results to be ones that match exactly what I put in, especially when I've given a first and middle name. If I search for John William Smith, there are probably 20 results that have either John or William as a first or middle name but are not actually John William Smith before getting to an actual John William Smith. The fifth result is Ian McLaurin Smith. The name variants can be helpful but I'd like to see exact matches first.
by Christy Melick G2G6 Pilot (110k points)
Yes! Exact name, date, and location matches should be at the top. When the exact duplicate is number 20 on the list that I just had to scan meticulously through, that's frustrating.
+13 votes

Many Anglo-Saxon first names start with Æ, which is the normally recognised spelling and is also the practice on Wikipedia, but some people will be searching for them using the initial letters Ae or just A. Is it possible to program the search engine so that search results for names beginning with Ae and A also bring up names beginning with Æ? This is what Wikipedia has done.

One example is https://www.wikitree.com/wiki/Wessex-29, one of the Anglo-Saxon kings.
by Michael Cayley G2G6 Pilot (233k points)
+16 votes
This is probably hard to implement and therefore not likely to happen, but it would be nice if search could take into account prepositions in the last name.

For instance if I search on van den Broek, I would like matches to van den Broek (duh) and Broek (so name without prepositions), but not van den Berg (which is a totally different name, but comes up now because of the matching on van den B).

And conversely if I search on Broek I would like to have van den Broek show up as well.
by Joke van Veenendaal G2G6 Pilot (100k points)

From the page https://www.werelate.org/wiki/Special:Names?type=s&name=vandenbroek I get the impression that the WeRelate computer regards almost all names in the form vande*b*r* as variants. (And the computer also accepts vandenbos as a variant!)

ADDED: I just now submitted changes on that page, and the current display of the WeRelate page shows the effect of my changes. Here's what I submitted:

Change history

09:06, 5 April 2023 EllenSmith (Talk | contribs)

"Matches" rejected: - vandenberg vandenberge vandenbergh vandenberghe vandenberk vandenboer vandenbor vandenborn vandenbos vandenboss vandenbrand vandenbrande vandenbranden vandenbrekel vandenbrink vandenburg vandenburgh vandenhoek

"Matches" accepted: + vandebroek vandenbrock vandenbroeck vandenbroecke vandenbroeke vandenbroucke vanderbroek vanderbrook

Regarding "Broek" versus "van den Broek," werelate automatically treats them as equivalent. Their variant names pages for prefixed names say: "The unprefixed form of the name and its variants will also be included in searches."

WikiTree search does not have a similar feature. (But if "van den broek" is in the Other Last Names field, WikiTree search will look for matches to "van" "den" and "broek". It would be nice if it could return results for "broek" but ignore "van" and "den." Unfortunately, this could create other issues, as I known that there are some U.S. people who have "Van" as their given name or as their family name. I ran a search here for the name * Van and got  a result of 80506 Matches for * Van, including people who have Van as their surname, plus thousands of matches that I assume are for people who have a name like "van Broek" in the Other Last Names field.) 

+18 votes

It is certainly a tricky problem to address. A few thoughts:

  • We don't necessarily want the same results in a Search as we do when looking for duplicates/existing profiles in Add person. Search may want to show more.
  • It could be useful to have an exactness parameter on the location search.
  • I think the sorting of the results may be a harder problem to solve than the filtering. It is easy to come up with rules for filtering but the sorting is tricky. You don't want to just sort by best fit for one field and then by another field since, if there is a match that is slightly off on the name match but fits the birth and death dates perfectly, you probably want that higher that an exact match on the name and way off on the dates.
    There probably needs to be some "points" system for ranking them based on weightings for the match exactness for each field.
  • I use the WikiTree Browser extension features to improve the matches, especially the buttons to filter by location. Maybe some of these features could be built into the form.
by Rob Pavey G2G6 Pilot (213k points)
+17 votes
I'd probably add filters in place that you could optionally add or remove based on your comfort level of the profile you're creating. A lot like the Find function. So if you only want to look at (for example) exact Last Name matches, you can click that on and it will filter out all those that don't match. Only want matches within 2 years of the date fields? Click the filter and it will filter out all those that don't match the new criteria. Same with location. Add the filter for only those born in New York, and more fall out.

By default, I'd leave the filters off, but with all the variations people are asking for, I'd add the function, but give people the option to turn it on or off. I'd intentionally leave all filters off initially, as you do want people to look over the list, but when it comes up as these 100+ match the criteria, it's just too overwhelming unless you can pare it down.
by Scott Fulkerson G2G Astronaut (1.5m points)
+16 votes
Location!!

Most of my research is in Ireland, so don’t show me anyone not born in Ireland. I’m not going to look at anyone born in USA or Australia, and I miss actual matches that are displayed.

Once the location is more specific it would be nice to show a wider range depending if you know the date of birth or if it is an estimate. Obviously the match may have an estimated date so it still needs to cover that potential.
by L Greer G2G6 Mach 7 (78.0k points)
+6 votes
A few suggestions:

1. Ages listed oldest to youngest of those in family tree including siblings of ancestors etc. eg I’d like to find the oldest person I’m (closely) related to. I’m investigating if my cousin is the oldest ‘ever’ in our family.

2. Cause of death. Include a field for cause of death and provide some useful search options. Doctors always ask family medical history and this could be extremely useful to see unusual causes, common causes, causes at different ages etc.

3. Military service. Include a few fields for military service; service number, branch of service, country, years of service. For some this may be broken into several periods, etc. I would like to create a 3D tree showing ancestors, cousins, aunts & uncles who have served. In some countries it is an incredible source of pride to know who you are related to who has served. I’m only learning because of WikiTree the numbers of people I’m closely related to who has served but there is now way of visualising that information.
by Mark Johnston G2G Crew (610 points)
+10 votes
Most of the problems I run into are people who were born or died in the wrong country.

If I know they were born in Canada, getting results for born in England, United States or other locations can be frustrating.

If I don't have a birth or death place then any locations could be correct.

Similarly if I have a realistic birth and/or death date then dates that are more than 20 years off are very unlikely.

Though having said that the results with incorrect dates can be helpful in finding profiles that might be part of the same family.

My next suggestion is to encourage our members to search for people with similar names, dates and locations before creating new profile.  

https://www.wikitree.com/wiki/Special:SearchPerson

And search for maiden vs married names for those cultures that use both. And be careful with women that don't have a LNAB and may have been created with a married name.

I recently created a duplicate, the original profile had a first name, last name, no second name and a wife, the one I created had both first and second names, and last name, I didn't have a wife for him then.
by M Ross G2G6 Pilot (749k points)
In Search.

Add an option to list spouse(s).

OOPS just noticed I can check on spouse in a search already.
I think this is the first time I heard anyone mention the specificity of the date.  I would not want to see similar names that also had a specific date not like the new profile.   A close name and very close date might reveal a duplicate (such a birth vs christening date).

However, estimated dates on the new or existing profiles need a wide range because the year might have been estimated incorrectly (such as a young wife in a second marriage for the husband), or death records that were a bad estimate on the age of the deceased.
+9 votes

Very happy that this is the tech team's next project :) 

I agree with the above suggestion that the top results should be the names that most closely match your input and that return results using all names entered (first+middle name, lnab+current/other last name). For example when searching for an Amy Smith who married a Davidson, we should be able to find her in a search using "Amy" (first name) "Smith Davidson" (last name). And a search for my ancestor Gabriel Phocas dit Raymond should be found in a search using "Gabriel" (first name) "Phocas Raymond" or "Phocas dit Raymond" (last name).

It would be great to search using spouse's name (similar to how we can search using mother's/father's names).

I would like the search results to display parents' and spouse's names. Ancestry, MyHeritage, and FamilySearch all display tree search results like this. MH and FS also display all the childrens' names, but I think that's overkill. I like how Ancestry just shows Father, Mother, and Spouse (although I wish it would show all spouses not just one). And then also show birth date with location, and death date with location, and maybe marriage date(s) with location. With the current search results it is only really easy to see the birth date & location. The results screen shows one line of information per result, but I would prefer a box of information for each person found (or the info could be displayed in a table), with the filters & search fields easily accessible on the side, if that makes sense. This would allow us to quickly evaluate the info in the profile and find the person we're looking for.

by Valerie Penner G2G6 Mach 7 (78.7k points)
+11 votes

I'd like to have the ability to selectively remove specific names from the results list. I often find myself searching for a person whose first or last name is known to have many spellings, so I don't want to restrict my search to one spelling. However, in reviewing the results lists, it's common for the unusual name(s) I am looking for to be surrounded by numerous instances of a few more common names.  This makes it hard to skim the list. Therefore, I would often appreciate the ability to selectively filter out certain names that are abundant in the results but that I can be ruled out as "definitely not the person I am looking for." For example, if I am looking for Margaret Doerr, I would be happy to see results for variants like Maggie Doer, but I might like to filter out the instances of Marie Doré and Mary Dyer that turn up in the search results.

by Ellen Smith G2G Astronaut (1.5m points)
Oh I like this one. It would be great to pare down the list so that you can see which ones are real and which ones are red herrings.
+13 votes

For populations that have large uncertainty in life dates (for example, pre-1700 and Black Heritage people who were born in slavery), I would like to have the ability to expand the date range in the Add Person search beyond the current +/- 2 years. I see too many duplicates getting created by members who trusted the list of matches to be comprehensive, when they were dealing with a person whose birth date is a rough estimate within a range closer to 10 years than 2 years. If there were an option on the Add Person match list to "Expand the date ranges because this person's life dates are uncertain," I expect there would be fewer inadvertent duplicates. Alternatively, a larger date range could be used by default for pre-1700 profiles. (There seldom are very many matches for pre-1700 profiles, so this should not be a big burden.)

by Ellen Smith G2G Astronaut (1.5m points)

Yes, this is a huge problem for us. And as our number of profiles grows, it's going to get bigger.

+11 votes
So so glad you're going to be working on this!

I have had several instances where a duplicate exists with the same exact name, same exact birth and death dates, and yet I get a list of wonky name variations that aren't duplicates, but not the actual duplicate. It would really be great to get these exact matches to show up 100% of the time.
by Emma MacBeath G2G Astronaut (1.3m points)
+8 votes
I have noticed that the first name is becoming more and more prevalent on the list.  For example, if I search on Mary Smith, I get lots of them, but also lots of Mary X where X has nothing to do with Smith. Please remove all these meaningless suggestions.
by Rick Morley G2G6 Pilot (168k points)
+6 votes
Yes to the requests for location filtering.  PLEASE!!!
by Nan Starjak G2G6 Pilot (386k points)
+8 votes
Hi Jamie, we have discussed this previously so I think you know my thoughts. Many others have made similar suggestions above. Importantly, I think we need to be able to search or filter by marriage date, location and spouse (or at least other name). For women, I often know when and where they married , the name of their spouse and their last name at marriage but not their LNAB  (depends on the country) . Also, I can’t be sure of their name at death  (they may have remarried after the death of their husband) nor do I know their place of death. Currently, I find it easier to search for a duplicate first as I have more options to filter results. Strangely, a few weeks back, I actually created a duplicate for a profile I already managed which was weird. Perhaps this happened because I had locked the  existing profile for privacy?
by Susan Stopford G2G6 Mach 4 (45.2k points)
Been there, done that too duplicated 3 generations, it's frustrating to not get a match to an almost identical genuine duplicate because of a small difference of detail, but get masses of matches with major differences, eg wrong country or totally different middle names.

Related questions

+50 votes
6 answers
+14 votes
2 answers
+23 votes
2 answers
+5 votes
1 answer
+9 votes
0 answers

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright

...