Ok, so I've been working hard on this during the past week and the idea evolved.
So I still think it'd be good to develop some cache system in the future but that's not a priority.
The main idea is to allow a fast access to ordered information. To do that, I've developed a website that allow anyone to research amongst other people's entries on a particular topic and see only the informations these people found to be relevant. Each entry also include the source URL from which the information is from. For now the website is accessible through https://wired.sa.muel.coffee/
Also I'm currently working on a Google Chrome extension that allow users to:
- Add new entries simply by selecting text, images and links from pages and adding them the new entry, then allow them to manually edit it if they want to and send it to the entry database. (not implemented yet, WIP)
- Highlight information on the current page that is in common with entries which got this same webpage as a source (or Origin). (implemented, see the screenshot)
- Search into the entries database directly from the extension's popup (implemented, see the screenshot)
I've changed my mind about the way to access information: I think the good old search bar is the best, but the problem is that there are too much informations for the same search terms because Google and the web in general is full of commercial ads. So by creating a search engine that only reference useful knowledge (relative to the authors), this can be made more efficient.
However I'm still looking for a way to filter the content, avoid people from posting ads. For now I think of using a machine learning algorithm to detect ad posts and also check if the content match - even partially - with the source (so that they'd have to create a real webpage to post ad content). So, if any of you have even the slightest idea on how to sort spam posts solely based on their content, feel free to reply.
Finally, I'm still thinking about clustering the articles, to do that I first thought of using Linear Component Analysis but it needs to know the classes (clusters) labels, then I thought of K-mean but you also need to specify the number of clusters, so for now I'm planning on using DBSCAN clustering (on a friend's advice). Once the entries will be numerous enough (i.e. when I'll have finish the extension and open a "beta" stage so that people could start sending entries), I'll use the clustering to link each entry to other entries in the same cluster, so when people browse the website and find an entry on a specific topic, they can easily access related entries and discover associated concepts they didn't know about.
Anyway, if you want to talk about this project more in depth, check out arisuchan's Discord, I'm usually on it, just link this post and I'll show up.
I named the project "The Wired", because every entry is connected to another.