Stupid Bits From Waterfall's Code Over The Last Eighteen Months
I'm going to be honest, I know you're meant to use title case in titles but I don't actually know which words you're not meant to capitalise
I was dealing with a bug report in the Discord earlier and it hit me - all currently known bugs are artefacts of some code I wrote over a year ago and shoved in a file, never to be touched again.
Or, well, no, that's a little bit of an exaggeration. But not much - working on Waterfall has taught me a lot over the last year, so I thought, while I'm working on the API rewrite, it might be amusing to go over some of the more stupid bits of code I've written, the bad decisions, and just things that made me laugh.
All the dumb shit here is fixed in the API rewrite.
The bug report that spawned this post is this one, actually. Someone knows they've reblogged something, but the button never turns green. Annoying, since, other than XKit, we were the first Tumblelog site to implement that feature to my knowledge. Let's look at what could be the problem, shall we?
Notes are stored in their own table in the database, and it's currently over a million entries long. If something is meant to be displayed to a user, it comes from this table. In the case of detecting a reblog, it searches a combination - has [blog ID] reblogged [post source ID]? If yes, it'll be green.
The problem here arises from the simple fact that the notes table is not expected to be 100% consistent. In fact, it's the lowest priority part of the site - while obviously it's not desired, it's considered that if some notes don't get logged, that's fine. Avoid it if possible, but if it happens, no big deal. All the data to reconstruct the table is elsewhere anyway. We could nuked it and have every entry regenerated in less than an hour if we needed to.
And there's the problem. Searching the notes table is inefficient and stupid, especially since it's not consistent. What we SHOULD be doing is searching the Posts table, and asking whether that blog has reblogged a post with the same source post. About 5x faster, and significantly more accurate.
NSFW tags are mandatory. Always have been. Side ramble: I've started seeing a couple blogs putting "I define NSFW as..." in their blog descriptions. No, you're wrong, go and read the rules again.
Anyway, the code, right now, has hardcoded checks for specific tags. When we overhauled the tag system a few months ago, we pre-seeded two tags - "nsfw" and "NSFW". These are tag IDs 1 and 2 respectively (others were added to the database in the order they first appeared in posts - so the first actual tag on the site has an ID of 3). If you're logged out, a minor, or just have adult content turned off, the site runs a separate query to get posts for you that deliberately excludes posts tagged with that. Over time, the query has been amended to also include tag 938 in the query, because someone on mobile did "Nsfw" at some point.
You can see where this is going. Every tag variation, it needs to be added manually to the queries. The same is true of DNR and DNI tags - variations in casing keep appearing and have to be added.
This doesn't go into people who use the tag "nsfw gif" or some such, and not the plain nsfw tag - those are completely omitted from the queries and not caught. There's a really, REALLY obvious solution I missed here, and I'm upset it took me so long to realise it.
When the post is loaded, just. Have it check the first four letters of all the tags and see if they match "nsfw", and if it is, skip it and get another one if appropriate to do so. It's so simple. I'm so dumb. Writing this I realise I haven't gotten it set up in the API rewrite to use the same tactic for DNR and DNI yet. I'm an idiot.
3. Upload Limits
This was just dumb on my part. Most people think of megabytes as being 1000 kilobytes. In actuality, it's 1024 kilobytes. A gigabyte is 1024 megabytes, and a kilobyte is 1024 bytes, etc. When I added the upload limits, I completely forgot about this fact, as well as, apparently, the fact that I have a degree in computer science, and used 1000 as the figure.
This is fixed now, but there is still an issue relating to uploading stuff near the limit that's probably some weird base64 encoding issue. Oh yeah, speaking of.
This particular bug related to when you're editing a post rather than making a new one. If you uploaded a gif in the original and go to edit it, the GIF plays normally. But when you hit save, it turns into a static image at whatever frame it was on when you hit save.
And you know what? There's at least three more I should put here. I really want to. There's another dozen I could put here but aren't as interesting. But I've held my head in my hands three times already reading stuff, so I'll leave it there.
The point is - this has been one hell of a learning experience.
Art Theft 2.0 - Overhauling the System
As far as I'm aware, no other site has an art theft prevention system. Waterfall's is the first, and I'm quite proud of it. However, it's an imperfect system, and I have some improvements I want to make to it.
Art Theft 1.0 - An Overview
Art Theft 1.0 (or, to be more accurate, the current version is 1.3 or so) is an extraordinarily simple piece of code. Unfortunately, extraordinarily simple means extraordinarily easy to defeat.
When a piece is marked as art, three things happen in order. First, the image is hashed to MD5. Then, a list of MD5s for all art uploaded that don't belong to the user uploading is retrieved, and the hash is checked against each one. Quick and painless.
MD5 is best described as a "signature" that can be applied to something to see if it's the same thing as something else. The problem is (while it's old and collision attacks have been demonstrated now), changing a single byte in the file results in a different MD5 signature.
This means that lazy thieves get caught, but anyone else doesn't. I've been thinking about this a lot over the last 6 months, and I think that - after the app is done - it's time to address these shortcomings.
Before I continue - while the system is easy to defeat, you'll still be banned if you crop out a watermark and upload it as your own thing.
Art Theft 2.0 - Rise of the Machines
The current system is instantaneous. You press upload, it tells you in a couple of seconds whether you're naughty or not. After a brief bit of thinking, I realised this cannot be the case with any sufficiently advanced system that can be called "decent". So first things first - when Art Theft 2.0 rolls out, there'll be two states to art posts. Unverfied, and verified. The main difference is that unverfied posts will just have a yellow icon instead of green for the art symbol in the corner of the header. If a post passes verification - no problem, nothing happens. If a post fails verification, it works the same way as it does now, it'll be silently converted into a reblog of the artist's original post. The major difference is that since there's a chance that post will be reblogged while it's awaiting verification, any reblogs of that post will need to be converted too. Luckily, the way the site stores post chains means this is not a problem at all.
While a post is in unverified state, the site will be running the process in the background. Let's go over what it'll be. It's SIGNIFICANTLY more complex that the current method, and requires some special hardware, so before the cutoff, I'm going to link our Patreon. Ordinarily it's hidden down in the site footer because we feel weird about taking money without giving anything in return (other than... the site I guess?) but the faster we get the hardware to run this (and the more of it), the faster we can improve things.
The first stage of the process will be to convert it into a grid. On each of these grids, an MD5 (or perhaps a SHA256) is generated. It'll then search for images where there's matching MD5s. In theory, there should never be any unless the image is a straight duplicate. This is the low hanging fruit part - it can stop the process if it finds a match here.
But what if we modify a square?
We drew a line in. That square now gets a different MD5. But, since collisions CAN happen - for example, if someone does a plain white background, or transparency, and whatever grid size we use ends up capturing fully transparent/white squares - that's enough to throw it off. So instead, it'll go off how many squares are the same. This is a good time to introduce confidence scoring.
If all but one square returns an MD5 match to an existing image, we can say pretty confidently it's a repost. Let's say 98% confident. The system will reclassify that post as stolen, and change it to a reblog.
But what about more complicated scenarios? The grid system isn't the only method we'll be using, and they're less clear cut. We need to assign a cutoff or two - how confident should the system be to act autonomously? If it's less confident, it should ask a mod to review it manually. At the same time, at some point, the system should be confident enough that it's not a repost that it doesn't bother us. We'll settle on 90% confidence for autonomous action and 20% confidence for not bothering us until we've refined the system. Now, let's go over the other methods we'll be using.
Resizing something is a common way to get around filters of this kind, so we need to keep records of different sizes of the art too. We'll also need to check for images being flipped - the same way that people uploading TV episodes on YouTube do to get around the copyright filters. Ading borders is something we need to check as well.
Why so Blue?
Another common way of getting around filters is changing the colours of something - either the colour itself, or the saturation of it. Checking this will be a pain in the ass, but is essential to a comprehensive theft prevention system.
I See You
Finally, the part that'll take the longest - visual comparisons. If all the above fail, there's a chance the thief is skilled. So, we pass it to an AI to look at. This, once we've gotten it working right, is all the above on steroids. It'll be able to look at it and say whether it's seen it before, as well as a confidence score. If it gets to this stage, unless it's 100% certain, a mod will likely be required for intervention - after all, we've seen what visual recognition is like with Tumblr's porn filter.
This step has some nice bonuses, however. It'll be able to see if it's been blurred, is a cropped version blown up to full size, or whether it's seen something similar before but watermarks have been removed or text altered. It might even be able to tell whether it's a trace of someone else's work or what's been used as a reference - however, we're consciously choosing not to intervene on that stuff, and it's a waste of resources to try.
Here's what it makes of our test piece (rendered in paint - right now, we get a text readout of pixel areas that I've had to translate into something easy to read). As you can see, it's far from perfect, which is why manual mod intervention will be mandatory for this stage.
The above is about half the system we're implementing, excluding the experiments that are more curiosity rather than something serious to include. We're not listing them all here because the post will drag on a bit, and because we want there to be some element of secrecy so you can't find holes to defeat it.
It's a pretty complex system and our aim is that any given art piece should take no longer than 20 minutes to verify. In an ideal world, it'd be 5 minutes - but we're unlikely to have the budget for that any time soon, and as more art is uploaded, the longer it'll inherently take.
Suffice it to say - art theft, while a unique system, is flawed. We want to fix it, and we want to share with you what we're doing to improve on it.
Thanks for reading!
Going forward, all important dev news will be on this blog instead - that way you can keep up to date with "important" site stuff without seeing me ramble about stuff you don't care about.