Wikipedia talk:Link rot/FABLE

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Initial tool discussion[edit]

First off, way to go Harsha and student(s)? for getting the Toolforge tool up and running! The interface looks great.

A couple initial suggestions:

  1. When clicking an external link IMO it should open in a new tab rather than navigating away from the tool since that could (accidentally) result in loss of session information ie. loss of work by the tool user. Also easier on the workflow.
  2. The Article column shows the full URL of the Wikipedia article. Typically this would simply be the name of the article with a link to the article, it would be cleaner and easier to read. Hovering over the name would show the full URL in the lower-left side of the browser as normal.

-- GreenC 15:37, 17 October 2023 (UTC)[reply]

When I click the next page (> button) I get a "Feedback Uploaded Successfully!" popup even if I haven't done anything. — Qwerfjkltalk 16:51, 17 October 2023 (UTC)[reply]

Thank you User:GreenC and User:Qwerfjkl for the quick feedback. If you have any more comments, please keep them coming. I have some thoughts of my own for how to improve the Toolforge page. I'm a bit slammed this week, but I'll share my ideas here next week.

User:Anishnya did all the work in setting up the Toolforge page. He has now graduated. So, I plan to hire a new student to implement all the revisions that we compile here. HarshaMadhyastha (talk) 22:41, 17 October 2023 (UTC)[reply]

This is a somewhat minor thing, but it might be nice if it gave a preview for the URL. Something like this (the link may take a minute or two to load). — Qwerfjkltalk 07:09, 18 October 2023 (UTC)[reply]

Permutations[edit]

I see one where the broken link and new link are the same URL https://www.bbc.co.uk/sport/football/19690546 in Darren Bent - looking at the article, indeed it is marked with a {{dead link}} template. I guess the solution in this case is flag it "Correct". I was confused what to do in that situation, looking for a "Not dead" option, though that is probably redundant with "Correct".

There are the permutations:

  1. Broken link is not broken / New url is not broken (same url) -- Correct
  2. Broken link is not broken / New url is not broken (different url) -- ??
  3. Broken link is not broken / New url is broken -- ??
  4. Broken link is broken / New url is not broken -- Correct
  5. Broken link is broken / New url is broken (same url) -- Incorrect
  6. Broken link is broken / New url is broken (different url) -- Incorrect

I think that's all? There are two situations that are unclear if they are Correct or Incorrect (#2 and #3). A popup tooltip with this table (or similar) would help users decide when to choose Correct or Incorrect. It could reduce the number of Unsure responses. -- GreenC 15:45, 17 October 2023 (UTC)[reply]

To reduce the number of cases to consider, before we post FABLE's output to the page on Toolforge, an easy thing for us to do would be weed out cases where the broken link and the suggested replacement have the same URL. In these cases, there is no need to rewrite the broken link. So, might as well exclude them.
It should also be easy for us to identify and discard cases where the identified new URL is broken. Of course, there is the chance that the new URL was not broken when FABLE identified it as a replacement for the broken link, but the new URL stopped working by the time a user got around to checking it on the Toolforge page. But, at least FABLE shouldn't be suggesting URL replacements where the new URL is broken.
So, if we filter out FABLE's output appropriately before posting to the Toolforge page, all of the rows on it should ideally correspond only to cases #2 and #4. In both of those cases, determining whether the new URL is Correct or Incorrect will require manual inspection. HarshaMadhyastha (talk) 19:49, 17 October 2023 (UTC)[reply]
I should take back my claim that "It should also be easy for us to identify and discard cases where the identified new URL is broken." I expect that it will be easy for us to do so for 404s, DNS failures, etc. But, I suspect we'll miss some, if not many, soft-404s. HarshaMadhyastha (talk) 22:32, 17 October 2023 (UTC)[reply]
The code found here may be of some use. — Qwerfjkltalk 07:13, 18 October 2023 (UTC)[reply]
HarshaMadhyastha: It looks like there are only 1,400 links to check? If that's the total, the interface probably doesn't need to be too elaborate. Soft404s.. that could impact #1 also. It's a reason this tool exists, to identify soft404s. Maybe no filtering, but provide guidance/documentation/tooltips. Also content drift, where both URLs are the same, but the new URL has different content, such as current weather, sports and market info. -- GreenC 19:16, 18 October 2023 (UTC)[reply]
@GreenC, note that that's roughly 10% of the toal. (As a first step, we have run FABLE on 18,000 (i.e., roughly 10%) of all the links that have been marked permanently dead[2]. FABLE identified new URLs for about 1,400 of these links.) — Qwerfjkltalk 19:45, 18 October 2023 (UTC)[reply]
@Qwerfjkl is right. At the moment, the Toolforge page lists the URL replacements we have found by running FABLE on about 10% of all the links marked permanently dead. When we run the system on all such links, I expect that the total number of new URLs to check will increase by about 10x to ~14,000.
I didn't get your comment about content drift. If both the old and new URLs are the same, wouldn't they both lead to the same page? How can the content be different? Maybe I'm misunderstanding what you mean by "both URLs are the same" ... HarshaMadhyastha (talk) 20:24, 18 October 2023 (UTC)[reply]
@HarshaMadhyastha, re content drift, I presume GreenC means that the content of the webpage at the url can change over time. That can make the new url effectively dead without it actually being dead. — Qwerfjkltalk 20:40, 18 October 2023 (UTC)[reply]
It might even still have usable content, but missing or changed partly. Compare old (via Wayback) with new. The content drifted because the new page has some of the same info but is also missing info. The classic example is weather and sports scores which keep changing frequently at the same URL. (the example is not the same URL due to a redirect but same idea). -- GreenC 23:43, 18 October 2023 (UTC)[reply]
Sure. I get that the new URLs we find can help/hurt when there is content drift. But, I still don't see the utility of including cases on the Toolforge page where the old and new URLs are identical. If the old URL is broken, then by definition, the new URL is broken too; it is the same URL after all. Whereas, if the old URL is not broken, then the old and new URLs will both lead to the same page. So, there is no need to replace the old URL. HarshaMadhyastha (talk) 18:56, 19 October 2023 (UTC)[reply]
In https://www.bbc.com/sport/football/19690546 it's telling us there an error in the wikitext, a citation (possibly) incorrectly marked dead. However that's not the purpose of this tool, it's an unintentional/emergent discovery. If you want to filter them it's OK. If you want to search for them and make a list and send for further analysis by another tool/bot that would be fine too. -- GreenC 19:08, 19 October 2023 (UTC)[reply]

Making it easier to review at scale[edit]

Extrapolating from the number of URL replacements that we have found so far, I estimate that we will have around 14,000 suggested URL replacements to review after running FABLE on all the links marked permanently dead. This number will increase further in the future once we expand to run FABLE on other types of dead links where archived copies are not appropriate, e.g., when the intent of a link is to point to whatever content is on the page when a user follows the link, not the content that existed at the time the link was created. I am worried that it will become intractable to manually review thousands of suggested URL replacements.

To simplify the review process, I am thinking it would be helpful to leverage the patterns in how URLs are transformed. For example, once a user reviews and determines that a few of the suggested replacements for a bunch of similar URLs on the same site are correct, then the Toolforge page should make it easy to mark all the remaining ones that follow the same URL transformation pattern as correct. Same with if first few are incorrect, then make it easy to mark all the remaining as incorrect.

For this, the page should display all suggested URL replacements sorted by the "Broken link" column, so that similar URLs show up together.

In addition, the page could allow the user to enter a URL transformation rule (e.g., a regular expression for the broken link and a corresponding regular expression for the new URL) and specify whether to mark all rows which match this rule to either be Correct or Incorrect. We can get to this if we find that there are some sites in which we find hundreds of broken links for which we have suggested replacements.

Thoughts? @GreenC @Qwerfjkl — Preceding unsigned comment added by HarshaMadhyastha (talkcontribs) 18:24, 26 October 2023 (UTC)[reply]

These are great ideas. I agree, once you discover a replacement it is often applicable to many other URLs. The ability to express it via regex is perfect. The patterns are useful information that could be saved and made available to anyone who wants to create a fully automated bot on other wikis without needing to rerun the manual review tool. -- GreenC 18:47, 26 October 2023 (UTC)[reply]
Great! Glad to hear that we are on the same page. And, you totally read my mind about the patterns being useful to fix dead links in other contexts!
I'll start the process of hiring a student to work on this, and we'll try to have all the ideas discussed on this page implemented soon. HarshaMadhyastha (talk) 00:53, 27 October 2023 (UTC)[reply]

New version[edit]

Hello @GreenC @Qwerfjkl

It has taken us a while, but we now have a new version of our site on Toolforge. Same link as before: https://fable.toolforge.org/

Apart from fixing a number of bugs in the original version, here are some of the new features which hopefully make it easier to review these links:

  • The first two columns are now sortable, so that one can review together either all the broken links that appear in the same article or all the broken links in the same domain.
  • We have added the ability to search for an arbitrary string. The site also provides the option to mark all the search results as either Correct or Incorrect. We are envisioning that this will be helpful when, after reviewing a few links on a particular domain, you might realize that either all the suggested URL replacements for them are correct or all are incorrect. Then, you can search for that domain and mark all search results as correct or incorrect, as appropriate, after a quick review.
  • There is no explicit Submit button now. Any feedback is pushed to the server as soon as it is updated.

We had previously discussed having the user specify a regular expression for each domain after the user has reviewed the suggested URL replacements for that domain; these regular expressions can then be used to automate URL replacements on other wikis. We decided against this feature. Instead of putting the onus of coming up with the regular expression on users, we have developed a tool which can infer the underlying pattern based on the URL replacements that users have identified as Correct.

Please let us know if you have any suggestions for further improving the site. Please also try out the site and see if you find any bugs.

Thanks! HarshaMadhyastha (talk) 16:50, 4 April 2024 (UTC)[reply]

I love this simple and efficient interface. It's "snappy". Easy to go through quickly. A model for how to do things like this. Is the source available? I can't remember what percentage you believe are incorrect but I think documenting it will help users better judge. I just did a dozen or so and they were all correct as far as I can see. If I know 1 in 10 is probably incorrect that would help to better make evaluate if I'm accepting too freely. It's very hard to judge content drift situations without seeing the original page. Would it make sense to also include a link to a Wayback version of the original? I can provide you with a GNU awk command-line tool that will return the Wayback URL if it exists eg.
./api -u 'http://www.fallingrain.com/world/PK/3/Muriali.html'
..returns:
{"results": [{"url": "http://www.fallingrain.com/world/PK/3/Muriali.html", "archived_snapshots": {}, "timestamp": "20070101"}]}
ie. no result.
-- GreenC 19:05, 4 April 2024 (UTC)[reply]
Thank you for the quick feedback, @GreenC! Glad to hear that you like the new version. I'll ask the students to clean up their code and share the link to the source here soon.
As per our previous analyses, we expect 1 in 10 to be incorrect. Where do you think would be a good place to put that information?
I agree that referring to the Wayback version of the original would be helpful. But, currently, all of the broken links for which we have attempted to find a URL replacement are ones which are marked as permanently dead. So, the vast majority of them likely have no non-erroneous copies on the Wayback Machine. We focused on these links to begin with since the value of finding a URL replacement is obvious. In our ongoing research, we are currently trying to determine as to, among the pages for which archived copies do exist, for which ones is it worthwhile to run FABLE to try and find if a URL replacement exists, e.g., is there some critical functionality on the page which does not work on the archived copy?
Also, we have so far run FABLE on only a 10% sample of all permanent dead links. Once you and others confirm that the current workflow for verifying the identified URL replacements makes sense, we can begin running FABLE on the remaining 90%. HarshaMadhyastha (talk) 19:16, 5 April 2024 (UTC)[reply]
Oh right forgot, permadead. In that case wayback links do not make sense. Only 10%. I started working through it and completed 25% (6 of 24 pages in 50 view mode) and thought I was on a roll, but really much less than 25%. So far, my experience has been positive.
  • Within domains, they all often all correct, or all incorrect. This makes it faster as I can make a heuristic judgements without checking every page other than spot checks.
  • I am seeing patterns in some domains that make them suitable for bot conversion. I am tracking them here Special:Diff/1217223113/1217388925 for later processing by WaybackMedic which will convert not only links that are permadead, but those that have an existing archive URL, converting them back to live links. The tool might only show 10 instances of a domain that are permadead, but there might be 150 instances on Wiki with archive URLs that can be converted to live links.
One other suggestion, it's kind of a big one, if you don't want to do it understood. Currently there is no way to track who is using the tool, thus it's unprotected. Often tools will use OAuth so users can log in with their Wikipedia ID and the tool can track which users are making the changes. The login process is seamless, like logging into https://iabot.org .. if you want information how to setup OAuth on Toolforge let me know. -- GreenC 19:44, 5 April 2024 (UTC)[reply]
Wow! You are fast! :) Great to see that the site is good enough to enable you to get through so many links so quickly.
I agree that it makes sense to track which users are using the tool. I'll ask my students to look into using OAuth. If you can please provide a link to any documentation on how to use OAuth on Toolforge, I can share it with my students.
BTW, did you indeed mean https://iabot.org/? At least for me, that link does not load. HarshaMadhyastha (talk) 21:07, 5 April 2024 (UTC)[reply]
https://iabot.toolforge.org is the correct link. * Pppery * it has begun... 21:46, 5 April 2024 (UTC)[reply]
Iabot.org is a new problem. I'll report it. It's easier to remember shortcut for the toolforge link. I found this guide to be helpful in a practical sense: "My first Flask OAuth tool". There's also one for Django and NodeJS. A general page on OAuth here. -- GreenC 23:08, 5 April 2024 (UTC)[reply]
Looks to me like it's written in NodeJS, so try wikitech:Help:Toolforge/My first NodeJS OAuth tool. — Qwerfjkltalk 17:58, 9 April 2024 (UTC)[reply]
Hi @GreenC, I forgot to mention that my students finished adding login support via OAuth to https://fable.toolforge.org/ last week. The user who last edited any particular row is logged in the back-end database. If you think that info should also be visible in the user interface, let us know. Thanks! HarshaMadhyastha (talk) 16:49, 9 May 2024 (UTC)[reply]
User:HarshaMadhyastha, Nice work by your student OAuth can be difficult. Personally I would make login required, or automatic (similar to iabot.org), before changes to the db can be made permanent. It's possible the Login button might dissuade editors who believe they don't have an account, it could inform somehow the login is automatic when you are already logged into your Wikipedia account. Displaying something like "Logged in as GreenC" somewhere would help to confirm. -- GreenC 21:25, 9 May 2024 (UTC)[reply]
The current site enforces that any user who wishes to submit feedback be logged in; when a user attempts to update any row, the user is shown a notification to that effect. And, once the user logs in, they will be shown "Welcome, Username" next to the Logout button. HarshaMadhyastha (talk) 22:58, 9 May 2024 (UTC)[reply]

Update[edit]

Hello HarshaMadhyastha,

  1. https://fable.toolforge.org/ has been fully processed (27 pages in 50-item view). It was probably 15 hours or so.
  2. It fixes syntax errors caused by user entry. Cool.
  3. It uncovered a widespread problem of short dashes being replaced by long dashes by errant user scripts.
  4. I found 42 domains that can be be further processed, listed here. Not sure what to call these, interpolation candidates? They will take time.
  5. The remaining 90% is still a lot of work. I know someone who can help. I would need to train and pay her. Or this could be turned over to the community. I'm not sure how long it will take the community and quality of results. Determining interpolation candidates is also not straightforward or documented.
  6. It would be helpful to have all 90% done so they can be sorted by domain which makes checking faster.
  7. You can send me the results, then I'll upload to wiki and remove the {{dead link}} tags. On Toolforge, save file to /data/project/fable/www/static/filename.txt which makes it available at https://tools-static.wmflabs.org/fable/filename.txt

-- GreenC 01:54, 9 April 2024 (UTC)[reply]

Thank you for putting so much time into reviewing all the data! Your tireless efforts inspire us to continue our work.
As you requested, all the data is now available in CSV form at https://tools-static.wmflabs.org/fable/permdead-aliases-apr2024.txt
It is concerning that, on this dataset, the URL replacements identified by FABLE have an accuracy of only 60%. We'll first take the time to analyze all the entries marked Incorrect to see what improvements we can make to FABLE. Then, hopefully sometime next month, we'll get started on running FABLE on the remaining 90% of permanent dead links and have the identified URL replacements ready for review sometime in June/July.
BTW, it is pretty cool that the URL replacements identified for permanent dead links helped uncover URL rewrite patterns that can be applied to so many other articles. Thank you for documenting all of those rules! HarshaMadhyastha (talk) 00:07, 10 April 2024 (UTC)[reply]

HarshaMadhyastha. The links are updated on wiki. Here are the stats:

4% seems low, but, every URL counts, and the interpolation information it uncovered will result in a lot more than 689.

While processing the fable tool, I noticed pages were often technically correct (status 200, displaying content and using the same URL identifier number) but the content was completely different, probably because websites recycle identifiers after a migration. These are hard to identify because you need to look at the context of the citation and check that if it verifies. There are red flags when off-topic, like a wiki article about a sporting event paired with a URL about the biology of an animal. Maybe one way is pull the article title from Wikipedia citation, and see if the string exists in the HTML of the new URL. If it exists the chances of being correct are very high, if not, it still might be right but is a red flag. -- GreenC 17:08, 10 April 2024 (UTC)[reply]

Thank you @GreenC for sharing these stats! Once you get a chance to apply the URL rewriting patterns learned to other pages, it would be great if you are able to share the stats on those too.
Based on my brief review of some of the URLs marked Incorrect, I agree with you that it does look like utilizing the article title from the Wikipedia citation can be very useful. As we conduct a more thorough review of all the Incorrect samples, we will see what else could help as well.
Hopefully the next set of samples we put out for review will have a much lower Incorrect fraction, so that we can make the most of yours and others' time. HarshaMadhyastha (talk) 20:29, 10 April 2024 (UTC)[reply]
User:HarshaMadhyastha: Interesting example. FABLE identified 4 links in wnbl.com.au that could be migrated, and they were fixed: Special:Diff/1198818425/1218254166 .. I saw a pattern, programmed the bot and reran it on the same page, and it fixed 8 more: Special:Diff/1218254166/1219068132. I then ran it on all pages with this domain, and in total it converted 403 links, removed 29 dead link templates (links were made live again), likewise switched 29 |url-status=dead to live, and added 151 archive URLs (links that are dead and can't be migrated). Thus FABLE uncovered a big problem with this domain. The custom code required was easy comprising one line:
sub("https?://wnbl[.]com[.]au", "https://wnbl.basketball", newurl)
Boilerplate code took over from there. Similar experiences with the other domains.
I think the lesson is that for every link FABLE identifies there are probably dozens more links within that domain that could be fixed by migration. The main thing is that FABLE discovered a problem that could be fixed. It did not determine the rewriting pattern. This was me doing a hit or miss check manually. I wonder if there is a way to automate checking FABLE results for rewriting patterns including generating a regex statement(s) that can easily be plugged into the bot (perhaps I could automate the bot via some kind of interface for others to run it, maybe). The pattern determination could also be a good application for generative AI. Just some thoughts. -- GreenC 15:54, 15 April 2024 (UTC)[reply]
GreenC, I would assume that most (or at least some) cases will require more regex than simple substitution. — Qwerfjkltalk 16:09, 15 April 2024 (UTC)[reply]
@GreenC Thank you for sharing this example!
Exploiting patterns like the one you described is a key part of FABLE's design. Once FABLE finds the new URLs for about 5 dead URLs in the same directory, we feed these (old URL, new URL) pairs as example inputs into a prediction engine. We then ask the engine to learn the underlying pattern (if any), and use the pattern to predict the new URL for another dead link in the same directory.
In our original implementation of FABLE, we used Microsoft Excel's Flash Fill to do such predictions. More recently, a student has built a Chrome extension which detects when the user visits a broken link and suggests the new URL predicted by FABLE, if it finds one. In this extension, we use ChatGPT as the prediction engine. As long as we have some previously discovered mappings for URLs in the same directory as the link visited by the user, it does not matter whether we have previously attempted to find the new URL for the specific link visited by the user.
The main thing missing in all this is that, so far, we haven't written code to output the learned pattern. This is partly because Flash Fill and ChatGPT are blackboxes. But, this is also because we haven't figured out a generic representation for all the patterns that we have encountered. As @Qwerfjkl points out, many of the patterns we have observed are not simple substitutions. For example, generating the new URL often involves looking up the page title and date from an archived copy, and the title might have to plugged into the new URL separated by hyphens or underscores, etc.
Taking inspiration from your ask, we will think about this again. Perhaps, this is a case of perfect being an enemy of the good. Rather than trying to cover all cases, we'll look into generation of at least the simple patterns. HarshaMadhyastha (talk) 20:31, 15 April 2024 (UTC)[reply]
Fascinating. One method I have used is searching the Wayback Machine for old redirects that are no longer on the live web. Organizations migrate URLs to a new system, and create redirects which are captured by the WM. After a few years for whatever reason (policy/technical/error) the organization deletes the redirects. However they still exist in the WM. The Wayback CDX API is a method to search for status 3xx captures for a given URL. -- GreenC 13:55, 16 April 2024 (UTC)[reply]
Here is an example. Given a dead URL: https://www.filmcompanion.in/madhumati-dil-tadap-tadap-ke-keh-raha-hai-song-inspired-by-18-century search the Wayback Machine with this command:
wget -q -O- 'https://web.archive.org/cdx/search/cdx?url=https://www.filmcompanion.in/madhumati-dil-tadap-tadap-ke-keh-raha-hai-song-inspired-by-18-century&MatchType=prefix' | awk -v u="https://www.filmcompanion.in/madhumati-dil-tadap-tadap-ke-keh-raha-hai-song-inspired-by-18-century" '/text\/html 30[12]/{a[++i]=$2}END{print "https://web.archive.org/web/" a[i] "/" u}'
The output will be:
https://web.archive.org/web/20200601080316/https://www.filmcompanion.in/madhumati-dil-tadap-tadap-ke-keh-raha-hai-song-inspired-by-18-century
Then search again using the output of the first command:
curl -ILs 'https://web.archive.org/web/20200601080316/https://www.filmcompanion.in/madhumati-dil-tadap-tadap-ke-keh-raha-hai-song-inspired-by-18-century' | /usr/local/bin/awk '/^[ ]*[Ll]ocation:/{sub("^[ ]*[Ll]ocation:[ ]*https?://web[.]archive[.]org/web/[0-9]{14}id_/", "", $0); a[++i]=$0}END{print a[i]}'
The output is:
location: https://web.archive.org/web/20200607035458/https://www.filmcompanion.in/music/madhumatis-dil-tadap-tadap-ke-kah-raha-was-inspired-by-an-18th-century-song/
The new URL is
https://www.filmcompanion.in/music/madhumatis-dil-tadap-tadap-ke-kah-raha-was-inspired-by-an-18th-century-song/
-- GreenC 14:03, 16 April 2024 (UTC)[reply]
Thanks @GreenC. When designing FABLE, we made the same observation. Leveraging historical redirections from the Wayback Machine is one of the first techniques FABLE attempts when trying to find the new URL for a broken link.
Please see Section 4.1.1 of our paper on FABLE: https://dl.acm.org/doi/pdf/10.1145/3618257.3624832 HarshaMadhyastha (talk) 16:52, 16 April 2024 (UTC)[reply]
Hi HarshaMadhyastha, I finished processing the 40 domains that have rewriting patterns. The results with diffs and stats are Wikipedia:Link_rot/URL_change_requests search on tag "FABLE-0424". The total number of links converted is 32,000 for the 40 domains. This work is detailed and complicated, but I have tried to generalize the code as much as possible so that configuring the bot is not too costly time wise. Each requires configuring the bot, including the rewriting code, compiling the bot, generating a list of target articles, running the bot, checking logs for errors, reprocessing pages as needed, uploading final diff results, spot checking results, and redoing any problems including modifying the code. That's the work flow. Typical problems that occur include sites that have redirects that go to the wrong page. Bot blocking or rate limiting mechanisms. Other complications related to how the site is configured. -- GreenC 16:47, 27 April 2024 (UTC)[reply]
Hi @GreenC, Thank you so much for your work on this, and for sharing the stats with us! On the one hand, it is great to hear that, though the URL replacements identified by FABLE has so far helped to only fix 689 permadead links, that input also helped in fixing 32,000 other dead links. But, on the other hand, I'm sorry that fixing all of these links has proved to be a cumbersome process.
To make it easier for you in the next iteration, we are going to spend time this summer working on automatic generation of the URL substitution patterns based on the "Correct"/"Incorrect" determinations made on FABLE's output. Is there anything else we can do that would help make things easier for you? HarshaMadhyastha (talk) 17:36, 28 April 2024 (UTC)[reply]