Thread: Royalroad duplicate chapters

Zajhein
posted on
06:58 on 24 April

Recently many series from Royalroad were flooded with duplicate chapters, up to 7 of the exact same chapter in some instances and adding up to thousands of duplicates for some series.

If there is any way to automatically scan and remove all the duplicates that would be great, because I doubt anyone has the time to do it manually.

Thanks.

fake-name
posted on
04:04 on 25 April

Hmm, yeah, I think I can deal with a fair bit of the dupes, though a lot of them aren't strictly duplicates (there's a bunch of different ways to access the same chapter, which is annoying).

Lemme try to put a cleaner together.

Zajhein
posted on
01:48 on 27 April

Thanks, and yeah the different websites hosting the same chapters is understandable. Although it's weird that some websites are updated at completely different times than when they're released, sometimes months behind and out of order as well. Don't know if it's something wrong with their RSS feeds or what but it's confusing when old chapters pop up randomly as if they're new.

fake-name
posted on
10:18 on 27 April

Thanks, and yeah the different websites hosting the same chapters is understandable. Although it's weird that some websites are updated at completely different times than when they're released, sometimes months behind and out of order as well. Don't know if it's something wrong with their RSS feeds or what but it's confusing when old chapters pop up randomly as if they're new.

Wait, is the issue duplicate chapters on the same site, or duplicate chapters on different sites?

RoyalRoadL is annoying because you can access chapters via <https://www.royalroad.com>, <https://www.royalroadl.com>, <https://royalroad.com> or <https://royalroadl.com>. From then, there's two ways to refer to the same chapter per site (there's a long form and a short form).

Duplicate chapters on different hosting sites is usually the author posting to multiple places. Ideally, this shouldn't be a problem, since the chapter parser should pull out a coherent chapter ordering, but that can break down if the author uses a bizarre or broken chapter numbering scheme.

TL;DR do you have a example series? If there's something beyond the RRL dupes going on, I'd like to know about it.

Zajhein
posted on
23:08 on 27 April

The duplicates, especially recently are almost all from the different ways to access Royalroad like you linked, but a few series like https://www.wlnupdates.com/series-id/63197/azarinth-healer include missing and scattered Scribble Hub links that update randomly, are months behind, and even upload out of order at times. Nothing unusual about their chapter numbering that I've noticed.

It would be preferable to simply remove the multiple feeds like Scribble hub for certain series since it's not like a different translation group is uploading an alternate version, but is merely the same content hosted on multiple sites. Perhaps just having a couple links to the multiple sources in the "Homepage" section would be good enough.

fake-name
posted on
22:15 on 3 May

Ok, I've implemented a set of batch processes that should clean up the duplicate RRL releases. It'll probably take a day or two to run through all the releases (the DB server that the site runs isn't that fast, and I have ~1.4 million RRL releases ATM).

With regard to scribblehub content, I don't see a automated way to handle chapter deduplication that'd be consistent. Additionally, I don't really see the point, since as long as the chapter number is extracted correctly the reading list facility doesn't care that there are multiple instances of the same chapter for detecting new releases.

Is there a particular reason having links to the same chapter on several sites is a problem? In general, the reading list tracking mechanisms shouldn't care about how many instances of the same chapter there are, as it uses the actual volume/chapter numbers.

Zajhein
posted on
07:51 on 4 May

Well I'm not sure if you're aware but an app in development called Novel Library accesses this site and a few others to fetch links to updated chapters. Giving notifications whenever any old chapters are added as if they were new. So all the duplicates from before and more recently slow down the update process and probably your server, along with confusing which chapters are actually new or from different sites. I mentioned this problem to them as well, but you might want to get in touch with them too if it's an issue.

I don't see why you'd even want random duplicate chapters if they serve no purpose but to clutter up the database, especially if they don't update correctly or on time like from scribblehub. Simply removing those unreliable feeds from certain series would probably be easiest and might relieve a bit of stress from the database sever as well. Once all the RRL duplicates are gone it might be simple for moderators to clean up any messy chapter lists sometime in the future if it's ever needed.

Also if some series need to be completely cleared and re-added due to confusing chapter numbers, you might try referencing how the add-on WebToEpub manages it, as it seems to be quite reliable for a number of sites. https://github.com/dteviot/WebToEpub

Anyway, thanks for all the hard work, and if you need the help of another moderator just let me know.

fake-name
posted on
02:09 on 5 May

Well I'm not sure if you're aware but an app in development called Novel Library accesses this site and a few others to fetch links to updated chapters. Giving notifications whenever any old chapters are added as if they were new. So all the duplicates from before and more recently slow down the update process and probably your server, along with confusing which chapters are actually new or from different sites. I mentioned this problem to them as well, but you might want to get in touch with them too if it's an issue.

Wait, what? That's incredibly stupid.I'm not actually sure how they even manage that, the API tries quite hard to guide people to using proper chapter numbering.

I don't see why you'd even want random duplicate chapters if they serve no purpose but to clutter up the database, especially if they don't update correctly or on time like from scribblehub.

Well, they're not duplicates, because they're on different sites.

The timing stuff is legitimately a problem, but I'm working on fixing the latency. Ideally, both scribblehub and RRL should both update at a similar time-scale.

Also, the core issue here is which version is the "correct" one?. I can't come up with a way to determine that in an automated manner, and I'm unconvinced a general solution even exists. Given the automated nature of the site, requiring manual selection for the canonical source is not a viable option.

Considering that (if you use the site correctly), duplicate chapters are harmless, I don't see any reason to spend all any effort dealing with the issue.

Also if some series need to be completely cleared and re-added due to confusing chapter numbers, you might try referencing how the add-on WebToEpub manages it, as it seems to be quite reliable for a number of sites.

The way that works is that it processes an entire site in one pass, in a stateful manner.

Unfortunately, this isn't possible to do with RSS feeds (which are largely what power this site). The challenge of RSS content is that basically, you have a URL and a title for a single chapter. I don't have any additional state to know where I am in a sequence of chapters, because that's not provided by the RSS input format.

I do do one thing which (as far as I can tell) no one else out there does, which is try very hard to pull proper chapter numbers. This involves a giant pile of heuristics and a lot of test cases and a custom tokenizing pipeline.

Most other sites seem to generally be chronologically based, e.g. chapters are assumed to be monotonic and chapters added at a later date are more recent then chapters added earlier.

Critically, this is not a valid assumption, and undermines how probably 98% of the series on this site are actually numbered. The only time this breaks down is when a later translator starts re-translating a existing series. And for that case, I already have facilities for excluding releases from the counted chapter numbering, which makes the series sort correctly (in the on-site reading list mechanism, at least).

If an app is using the site as a backend, and yet completely ignoring the chapter numbers, that app is broken. This is 100% a bug in the app.

I'm kind of tempted to remove the post date from API and page content entirely, so they can't continue to use the site wrong.

Do the people doing this have a github project? I should file an issue. Edit: Found it - https://github.com/gmathi/NovelLibrary

Further edit: Whoa, that things ridiculous. They're not using the API and just scraping the HTML. WTF.

fake-name
posted on
02:40 on 5 May

Github Issue: https://github.com/gmathi/NovelLibrary/issues/106

Zajhein
posted on
04:49 on 5 May

Yeah, it's quite a strange app, but that's pretty much the same for every web/light novel app out there that I've tried, most are broken in one way or another and finding the least broken one is all people can do atm. Hopefully this one improves and uses your API correctly.

I understand there's no automatic way to determine which chapters are 'duplicates' from RSS feeds between sites, but I was suggesting removing entire RSS feeds that are found to be exactly the same from the author updating to multiple sites if there are issues with one. Whether by a moderator making note of which ones aren't working, or comparing RSS feeds themselves and removing any that isn't from a preferred site(the one with the least issues overall). But it really doesn't matter that much.

As for series with unorganized chapter names that can't be parsed, I figured either a moderator or you might be able to trigger a failure mode to dump the old data from the RSS feed and simply process the series' page itself, then allow the RSS feed to update new chapters by incoming date for those specific series or reprocess it's page once a week/month. Although that would probably take a lot of work and be hard to determine automatically. Just throwing out crazy ideas though, so don't take it too seriously.

fake-name
posted on
05:16 on 5 May

I understand there's no automatic way to determine which chapters are 'duplicates' from RSS feeds between sites, but I was suggesting removing entire RSS feeds that are found to be exactly the same from the author updating to multiple sites if there are issues with one. Whether by a moderator making note of which ones aren't working, or comparing RSS feeds themselves and removing any that isn't from a preferred site(the one with the least issues overall). But it really doesn't matter that much.

That could work, though FWIW, RoyalRoad and ScribbleHub aren't scraped using RSS, which is how I can do clever things with them (like the numbering heuristics that aren't possible in a RSS context). Their feeds come from my actual web-spider.

What has access to what metadata is still pretty limited. The spider is independent from the wlnupdates codebase, and doesn't have access to some of the metadata.

Right now, the force numbering system is managed manually by me editing a specific json file.I don't really have moderation tools to a substantial extent, and I've been implementing anti-abuse measures as-I-go in response to spammers/etc....

Is there a reason you're using an app in the first place? Outside of offline reading, I've tried to make wlnupdates pleasant and useable on phones. I certainly use it myself on my ancient iphone 5, and it works fine, but I'm never sure how much that is because I know exactly how do do whatever I want given I created the whole site.

if it's for the (generally poor) reading experience on other websites, that would also make sense. It is the motivation behind the project that spawned this website.


Really, I think what it comes down to is that using the integrated reading list tracking, duplicate chapters on different sites are a non-issue, which is why I'm disinclined to try to mitigate them. Also RoyalRoad and ScribbleHub have different reading experiences, so I like to keep the option of choosing what site people want to read on available as well.

Zajhein
posted on
06:11 on 5 May

Trying to get a mostly consistent and organized reading experience is why I've tried so many apps in the past. Being able to scroll or swipe from one chapter to the next in a customized reading mode that doesn't try to load comments, author notes, or advertisements is the biggest part. Then getting notifications from automatic updates for selected series, automatically marking chapters as read, along with being able to change the sort order and cover image all in one place would be nice.

Basically it would be perfect if someone took the https://github.com/inorichi/tachiyomi code and remade a branch app for web/light novels.

I've tried reading in a browser in the past but even broken apps seem to be simpler most of the time, maybe that's just me, or Tachiyomi simply spoiled me by demonstrating what's possible.

Zajhein
posted on
18:45 on 15 May

Not sure if I should create a new thread for this, but recently RoyalRoad chapters aren't being updated at all, and scribblehub is now producing duplicates for certain series like https://www.wlnupdates.com/series-id/107213/he-who-fights-with-monsters so I'm unsure if it's connected to the recent fixes you made or something new is going on.

fake-name
posted on
04:01 on 16 May

Not sure if I should create a new thread for this

This is fine. I'd rather know about problems in any event, so ¯\_(ツ)_/¯.

and scribblehub is now producing duplicates

This was a derptastic issue of me "spelling scribblehub two different ways" in the feeder systems, which prevented them from deduping against each other. They should be merged now.

I r smrt.

but recently RoyalRoad chapters aren't being updated at all,

I'm not sure what happened here. I think I'm having a queue bloating issue due to performance issues in the feed feeder, and it's losing updates as a result. I'll have to try to see what I can do to deal with the perf issues.

This is likely affecting all series on the site, which isn't great. I'm kind of surprised everything is working as well as it is. I guess the at-least once delivery approach I've taken is surprisingly robust.