/robowaifu/ - DIY Robot Wives

Advancing robotics to a point where anime catgrill meidos in tiny miniskirts are a reality.

LynxChan updated to 2.5.7, let me know whether there are any issues (admin at j dot w).


Reports of my death have been greatly overestimiste.

Still trying to get done with some IRL work, but should be able to update some stuff soon.

#WEALWAYSWIN

Max message length: 6144

Drag files to upload or
click here to select them

Maximum 5 files / Maximum size: 20.00 MB

More

(used to delete files and postings)


Welcome to /robowaifu/, the exotic AI tavern where intrepid adventurers gather to swap loot & old war stories...


Datasets for Training AI Robowaifu Technician 04/09/2020 (Thu) 21:36:12 No.2300
Training AI and robowaifus requires immense amounts of data. It'd be useful to curate books and datasets to feed into our models or possibly build our own corpora to train on. The quality of data is really important. Garbage in is garbage out. The GPT2 pre-trained models for example are riddled with 'Advertisement' after paragraphs. Perhaps we can also discuss and share scripts for cleaning and preparing data here and anything else related to datasets. To start here are some large datasets I've found useful for training chatbots: >The Stanford Question Answering Dataset https://rajpurkar.github.io/SQuAD-explorer/ >Amazon QA http://jmcauley.ucsd.edu/data/amazon/qa/ >WikiText-103 https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/ >Arxiv Data from 24,000+ papers https://www.kaggle.com/neelshah18/arxivdataset >NIPS papers https://www.kaggle.com/benhamner/nips-papers >Frontiers in Neuroscience Journal Articles https://www.kaggle.com/markoarezina/frontiers-in-neuroscience-articles >Ubuntu Dialogue Corpus https://www.kaggle.com/rtatman/ubuntu-dialogue-corpus >4plebs.org data dump https://archive.org/details/4plebs-org-data-dump-2020-01 >The Movie Dialog Corpus https://www.kaggle.com/Cornell-University/movie-dialog-corpus >Common Crawl https://commoncrawl.org/the-data/
Open file (48.25 KB 866x527 Selection_275.png)
seems interesting. > https://the-eye.eu/public/AI/
>>2303 >related xpost >>8773
>Danbooru2020: A Large-Scale Crowdsourced and Tagged Anime Illustration Dataset >Danbooru2020 is a large-scale anime image database with 4.2m+ images annotated with 130m+ tags; it can be useful for machine learning purposes such as image recognition and generation. https://www.gwern.net/Danbooru2020
Found a really useful tool, pdf2txt.py, for converting PDFs into text, particularly research papers: https://github.com/euske/pdfminer It's used by this project to turn Arxiv into a dataset but I don't have 1 TB to mirror 1.5 million papers :^) https://github.com/mattbierbaum/arxiv-public-datasets But it works a lot better than ebook-convert and performs best with the -V option to fix fragmented text. Just need a script to unwrap the text and clean it up. Once I finish my script I'll create a dataset of the 800 or so papers I have saved. There's a lot of great curated information in there focused around solving robowaifu AI. Hopefully it will be useful for training AI research assistants and mentors. I've already gotten a ton of great ideas and learned some things I didn't know about just by discussing stuff with the standard GPT2-medium model and using some prompt engineering. Being able to finally train on Arxiv papers and get meaningful answers to specific questions will be amazing. I imagine tutorials will become a thing of the past once we can create AIs to answer questions around a given topic.
>>9240 >but I don't have 1 TB to mirror 1.5 million papers I do. I'll look at it and see what I can do.
>>9240 So, give me some idea which of the 3 scripts you'd like to use Anon? The scripts in bin will then create any of the three subdirectories: $ARXIV_DATA/tarpdfs # raw pdf files from Amazon AWS bucket $ARXIV_DATA/fulltext # .txt from raw .pdf $ARXIV_DATA/output # co-citation network, parsed author strings, etc >
>>9240 I decided to go with fulltext and, unsurprisingly, Python broke on me. WHY DOES PYTHON HATE ME SO LOL? :^) Apparently you have to have some kind of account for authorization or something.
>>9246 The fulltext but it's not much use to me if I can't train my models on them. I'll probably get a couple new hard drives soon. Arxiv does provide a public API though for grabbing papers: https://arxiv.org/help/api/ I don't have the resources to train on all 1.5 million anyway. Training on a GB of text takes about a day on my machine. It would be more useful to have a web crawler that goes through all the citations of selected papers and downloads ones it can find and also any papers matching specific keywords. Arxiv citations are pretty easy to scrape from the fulltext papers.
Open file (7.68 KB 146x159 dorothy-haze.png)
>>6737 I've fixed some errors in the generated anime transcripts. There was a common issue with two characters speaking on the same line and some genius using lowercase Ls for Is. Having fixed that and included the missing files, there's 20% more data now and a validation and test set. Text files for training: https://files.catbox.moe/b5d37o.xz Raw data and scripts to rebuild transcripts: https://files.catbox.moe/mpx203.xz Also transcribed an hour of Dorothy Haze's dialog from VA-11 Hall-A: https://files.catbox.moe/x0f3g5.xz
>>9408 Awesome. Thank you Anon, good work. Just think of what must be millions of man-hours of focused effort that have gone into our favorite Animus. Much of that effort was pre-production effort in the Story Depts. The script can be said to be a crystallization of all this human effort focused on creating scenarios & characters pleasing to males. So in effect, what you've put into our hands represents the consolidation of millions of hours of devoted efforts of extraordinarily hard-working artists from Japan. All of which is now ready to go back to work creating ever.more.devoted. robowaifus for all of us. I'd call that an important gift!
I prepared the Cornell Movie-Dialog Corpus into a text file for training with <|endoftext|> tokens between conversations: https://files.catbox.moe/pvi2ef.xz Website: https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html If I messed something up or the file is taken down the Python script to regenerate it: import json lines = open("movie_lines.txt", "rb").read().decode("utf-8", errors='ignore').strip().split("\n") line_db = {} for line in lines: line = line.split(" +++$+++ ") line_db[line[0]] = (line[3], line[4]) conversations = open("movie_conversations.txt", "rb").read().decode("utf-8", errors='ignore').strip().split("\n") for line in conversations: conversation_lines = json.loads(line.split(" +++$+++ ")[3].replace("'", '"')) for line in conversation_lines: speaker, text = line_db[line] print(f"{speaker.title()}: {text}") print("<|endoftext|>")
>>9500 Thanks, got it. Especially appreciate you showing us how it's done, too!
Open file (78.98 KB 965x631 human feedback.png)
Anyone here able to run GPT-2 on their GPU and wanna help build a dataset? I've made a script for creating data to train a reward model by using GPT-2 to generate conversations between two random characters (bot-to-bot). Basically all you have to do is pick the best responses it generates until reaching the max token length and it starts generating another conversation. You can also write in your own responses if the generation is particularly bad or stuck.
Open file (183.92 KB 850x650 1616562664087.png)
>>9503 Looks like an interesting project. Might even turn out to be a rudimentary beginning to anon's Robowaifu@home idea (>>8958), in that we start sharing our own efforts onto a common dataset. Do you have any plans to redistribute the results openly for everyone's benefit back here Anon? Also, this vaguely reminds me in a fashion of anon's >replacing rewards with examples post (>>9438 and following). I wonder if you might be able to kind of integrate that sort of approach into your project?
>>9512 The dataset will be maintained on GitLab. I think that's the easiest way for both contributing data and getting updates, and people can fork it to make different versions if they wish. The recursive classification algorithm is similar to temporal difference learning and needs time steps. I can see this algorithm being useful for completing tasks in a conversation but I don't think this dataset will be much help to it since there aren't any tasks being solved, unless the end result of conversations are reformulated into tasks somehow. Recursive classification still needs another paper or two to develop it into solving multiple tasks and ones it has never seen before in training for unsupervised learning. There are other ways though to make this dataset useful beyond a human feedback reward model. MuZero's dynamics model that predicts the next state given an action and the current state could be modified into seeking a goal state rather than trying to win in a board game. Given enough processing power and hindsight experience replay (HER) to learn from mistakes it might be able to learn how to lead a conversation to a target state. The recursive classification algorithm's results aren't quite as impressive as HER which can learn multiple different tasks, and there has been a significant improvement to HER by combining it with expectation maximization: https://arxiv.org/abs/2006.07549
>>9517 OK, count me in if I can manage to run it with you. My best box has an i7 in it. It's for school, but I can probably set up a dual-boot for it. Is Ubuntu good enough, distro-wise (Python versions, etc., etc., etc.) ? I would appreciate detailed, tutorial-style setup, operation, and results pushes, etc., if you would please. I would recommend you consider sort of approaching this effort the way the Foldit guys did (>>9028). Namely, fashion it sort of like a game that anons can 'play' . The fundamental premise seems to lend itself to this paradigm, and we're more likely to see ongoing participation by only vaguely interested anons that way IMO. This also sounds like it's a project that likely should have it's own thread to me. That way the long chain of (entirely unrelated) dialogue about development doesn't detract from this dataset thread which seems more like it should be a 'library archive' type thread to me.
>>9520 >and we're more likely to see ongoing participation by only vaguely interested anons that way IMO. Also, on this same general tack, what about the idea of setting up a server online and letting many, many anons 'play' along with the waifu this way. After all in this scenario, it's not the raw horsepower and number-crunching that is the valuable thing. Rather obtaining the human-reasoning needed to assess and score the reasonableness of any particular output is the valuable bit. There's a lot more that could be said, but we should probably hold off on it until you make a decision about a new project thread or not.
>>9520 I don't recommend running GPT-2 on a CPU because it's so slow. As the conversation gets longer it will take over half a minute to generate responses even with just the small model and a fast CPU. If there was already a crude T5 chat model it'd be a different story, but until then an Nvidia GPU with at least 3 GB and CUDA 7.0 is needed or Google Colab. The code should be able to run on Linux, Windows or Mac just fine. I recommend using Python 3.8 or 3.7 with Pytorch 1.8.x and CUDA 11.1. CUDA installation will depend on your platform. I'm not familiar with other distros but on Debian Linux buster-backports provides CUDA 11.1. Older cards from 2018 or earlier should be fine with just CUDA 10.2 which PyTorch also supports. Make sure when installing Python on Windows to check 'Add Python 3.x to PATH', otherwise Python won't run from the terminal. Then get PyTorch: https://pytorch.org/get-started/locally/ (It's easiest to install with pip) Once Python and PyTorch are installed, install the transformers library from a terminal with: python -m pip install --user transformers This should be all you need. >>9523 If I had the money for a server I'd rent a GPU instance, train it on extra data and have it done in a few hours. The motivation of this dataset is to train the model with as little compute as possible. I just need a little help for now to push it into a usable state so it can be used instead of GPT-2. T5 takes less than half a second on the CPU to generate a response. That'd make it much more approachable for anons to participate without expensive hardware. The validation perplexity is at 25 now so there's hope. Another idea I have is to train the reward model with sentences taken from other samples in the dataset but that don't match the conversation, or by discriminating its own generated responses as a GAN. I think the former would help it learn more sensible replies. A GAN might be too unstable to train. I'll make a thread for it later so these posts can be moved there.
>>9526 Alright. I'll try to set up a dual-boot before the upcoming weekend is out. Debian, is it Anon? >The validation perplexity is at 25 now so there's hope. Good news. Please keep us up to date Anon. It would be absolutely marvelous to be able to run a reasonably responsive and competent chatbot on embedded hardware today. And I would say there's no need to make a new thread unless you yourself deem it reasonable.
>>9569 Yeah, personally I prefer it because they supported my ancient Pentium 4 for nearly two decades but it tends to lag behind in updates because of this support for older systems. It took over a year to even use the full capabilities of my GPU I bought 2 years ago. Mint and Ubuntu are the easiest to install and use and tend to have more recent updates if you don't need that long-term stability. The chat model has been slowly inching forward but slowed to a crawl. I've been playing around with different hyperparameters but no luck. I think it has bottomed out unless I start doing 2-hour optimizer steps with a ridiculous amount of gradient accumulation steps. The validation set perplexity is at 20 now which isn't bad but not quite good enough either. Once I start throwing other training tasks at it though there should be further gains and we might be able to skip GPT-2 completely. Ideally about 10 GB of data is needed for training but I only have 40 MB of high-quality chat data so the only option right now is to train on other data. I've written a Wikipedia and Stack Exchange scraper to soak up a ton of data, just need to process it into tasks and train. It all hinges on the reward model working well really. Without a working reward model, creating this dataset won't make a big difference without a team of novelists to crank out a 100 books worth of data.
>>9575 >but I only have 40 MB of high-quality chat data What data do you need? Could you use extracted dialogs from subtitles?
>>9595 >Could you use extracted dialogs from subtitles? I think he's already doing that Anon: >>9408
>>9600 Ah, I see, but limited to the source of a single waifu. Finding ways how she would say something and then change other dialogs automatically with a script might help to create more data.
>>9595 Any data that can be formulated into a query and response can be used. At the moment I'm using subtitles from anime and movies with character names but I could potentially use ones without names and just predict the next sentence or line. Some of what it learns on other data and tasks will transfer to learning chat dialog. I could train it on a variety of other tasks like reversing the order of words in a sentence, labeling parts of speech, translation, determining whether recipes are highly rated or not, or go Jeopardy mode and predict the query from a response. The only limit of what you can teach a text-to-text transformer is what you can fit into text, your processing power, and the amount of data you have. The question is what would be the best data to train on? I have a very tiny amount of compute and can't test a hundred different things. It's not obvious what skills are necessary to comprehend a sentence either and what tasks will improve those skills. Basketball training drills for instance don't look anything like basketball but they significantly improve someone's performance.
>>9603 Well, one thing I have is a metric boatload of raw JSONs of shitposting for roughly a year and a half or so. Probably 60+ boards if I dug around. Now this is, again, shitposting, so YMMV. But many of these are deterministically post/response. I could imagine automatically going through it all, finding the many thousands of post/reply pairs and then maybe getting a human involved in check that it is in fact a query/response pair? I've never yet tried to go through and pull all the JSON out from all these archives in their entirety yet, but I do it for /robowaifu/ here generally a few times a week. So, shouldn't be too difficult to pull those and push the archive file to catbox.
v>>9605 BTW, I've already gotten a fairly well wrung-out mechanism for parsing the raw JSON into individual posts written (BUMP, Waifusearch), so it might be wisdom if you can clearly specify how you'd like the data parsed out, and I could do a lot of preprocessing data-massaging in advance for you, instead of just giving you a big dump of un-processed raw JSON files.
>>9607 Also BTW, if you haven't ever done so yet, you can examine exactly what the JSON I'm speaking of looks like for yourself Anon. For example: http://bhlnasxdkbaoxf4gtpbhavref7l2j3bwooes77hqcacxztkindztzrad.onion/robowaifu/res/2300.json https://alogs.theГунтretort.com/robowaifu/res/2300.json (Just replace the domain w/ the proper AlogSpace URI if you don't use Tor)
>>9603 That post reminded me that there's a fan-run website about the gameshow Jeopardy. Their archive got over 400000 clue-question pairs: https://j-archive.com/ I think the more funny and intelligent the quiz show is, the less useful it is for building basic understanding. The gold standard of entertaining quiz games is You Don't Know Jack IMHO. That stuff is too witty and punny to easily build anything faintly resembling common sense from that. The questions from YDKJ are too fragile, by that I mean slightly changing the wording is likely to completely screw with their meaning. Jeopardy isn't quite like that. Still, something more dull and easy than Jeopardy would be better. Maybe some ROMs from quiz games aimed at children have good boring common-sense stuff in plain text.
Open file (15.55 KB 494x198 yummytea.png)
Open file (37.57 KB 985x318 simulation.png)
>>9671 Yeah, it just becomes a database lookup at that point. But I speculate it can learn some useful information on reversing common queries and responses. Like "I'm fine" is usually a response to "How are you?" We take this prior knowledge for granted but unless a model learns this it will fail to make any connection. An interesting future research project might be having the model generate its own tasks to learn and explore looking at data in new ways unsupervised. Update to >>9503 The past few days I've tried a dozen different experiments to use the hidden state of the T5 encoder to discern whether a response matched a query in comparison to a response taken from another random query. Nothing was able to learn anything, which is kinda depressing because that might mean it might not do well later with a recurrent network modulating the hidden states. I'm not really sure why this is but I suspect because the T5 was pretrained for 100 GPU years or whatever amount of time on text-to-text and trying to train it to be used a completely different way with a pitiful GPU in a few hours is not happening. So I started feeding those queries and responses into the T5 model asking if the response makes sense to the query, and to output labels yes or no. Surprisingly it had no struggle learning this way and the responses are becoming much more sensible, even though it's only capable of discerning the right answer 70% of the time so far. In the state it's in it might even be usable in place of GPT-2 for generating a chat dataset, although the quality still has a long way to go. The first image shown is a conversation generated by picking from 3 responses by T5-chat as Dorothy and entering my own input for Jill, and the second image T5-chat as Haruhi. With this working now the chat dataset project can be used to train T5-chat so I'll be making a thread for it once everything is ready to go.
>>9671 >Maybe some ROMs from quiz games aimed at children have good boring common-sense stuff in plain text. Yeah, we're dealing with some odd juxtapositions in our endeavors here, and this is a fundamental one tbh. IMO, the only reasonable hope we have ATM of AI communications that will stand up for hours of engagement are ones that are necessarily 'dumbed down' to a child's level. (BTW, even that is already vastly beyond any of the animals, so a rather notable achievement). However, we also plainly want to be able to engage with our waifus as well, adults. A bit of a conundrum for now I'd say. >sage for off-topic
>>9678 >chats lol. Well sounds like you made a nice breakthrough by using a wonderfully simplistic 'trick'. That's encouraging. No doubt eventually we'll all be able to tick off 100GPU-years of compute on our 'smart' watches in a few years, but for now this is by far our best kind of approach very likely. That is, finding clever approaches that get right to the heart of the problem.
I just mentioned Chatterbot somewhere else, here's the corpus. URL might not work, since the dumb jokes of the forum software: https://github.com/Гунтhercox/chatterbot-corpus - this link here https://chatterbot.readthedocs.io/en/stable/corpus.html might be better anyways, since it comes with some explanations.
>>9753 >conversations: >have you read the communist >yes, marx had made some interesting observations. >stock market >you can never really predict the stock market. >stock market >my lawyer said i shouldn't give stock tips online. >stock market >mutual funds might be better unless you are wealthy. >stock market >i'm not sure an individual alone can really beat the market. >56 KB Top-tier conversation quality
>>9765 I gave an answer on how to handle this. But, I put it in the thread about chatbots here >>9780
Open file (40.28 KB 1112x1075 Selection_003.jpg)
Not sure if this is the right thread OP, just let me know and I can delete it if not. On this video (>>10463), the author promotes Weights and Biases papers page. It now redirects to a community page that seems like it might be interesting to the ML practitioners here on /robowaifu/.
Open file (174.76 KB 1196x828 archive.moe.png)
Some archives of 4chan posts from 2008-2015 SQL Database: https://archive.org/download/archive-moe-database-201506 Files: https://archive.org/details/@archivemoe Penfifteen Archive from 2004-2008: https://archive.org/details/studionyami-com_penfifteen-2012-03-05 And moar post archives: https://wiki.archiveteam.org/index.php/4chan I'm working on some dataset generating scripts for finetuning language models, including image-post pairs for multimodal training >>11731 It'll take a few months to download and process all the data. My plan is to compress the images to 384x384 webp files so each dataset isn't 200+ GB per board (/v/ is over 2 TB). SqueezeNet's input size is 227, AlexNet is 256 and VGG is 224, so I think that is sufficient and leaves room for data augmentation. If someone has the hardware to train StyleGAN2 at 512 or 1024, I'm sure they can download the archives and regenerate the dataset with the scripts. I'll release the image datasets and each board separately so people can pick what they want. Also if anyone wants to help I'll post the scripts when they're ready.
>>11778 Bandwidth is a real issue for me currently. I'll try to help out later. >4chan posts from 2008-2015 Nice. Pretty classic era tbh.
>>11778 >My plan is to compress the images to 384x384 webp files so each dataset isn't 200+ GB per board (/v/ is over 2 TB). Good thinking. How are you planning to situate each selection frame for each image Anon? Seems quite impractical to do by hand, yet there's a need for accurate placement/pre-scaling to capture the vital essence of each image, humanly-speaking.
>>11782 It took me three days just to download the database, 57.5 GB compressed. >>11889 I will be resizing the largest dimension down to 384 or smallest dimension up to 256. That way models can select any 256x256 crop of that, perhaps using a spatial transformer network to position the crop. However, GIFs and Webms will pose a significant challenge. I will skip those for now.
Wake the fuck up robofucker, we got an imageboard to burn: https://files.catbox.moe/6pslbq.xz These are /robowaifu/ posts up to July 2021 containing a post chain for each post. The script to regenerate it from the json files is included. The chains have been shuffled to avoid repetitions that would throw transformers into seizures. Now there's no excuse not to have a shitposting waifu while wasting time talking about making a robowaifu. Pray to God she motivates you into actually working.
>>12044 >It took me three days just to download the database, 57.5 GB compressed. Lel. Please go easy on us assistants, and try to figure out how to slice that up into MUCH smaller parts so we can help you out here, Anon. Think push the compute load out to the edges of the network, not saturate the wireline! :^) This is an exciting project idea, I sure hope you can pull it off. It should revolutionize the whole 'fun with waifus' paradigm!
>>12051 LOL. That was fast Anon, thanks!
>>12052 I can handle the database processing. The issue is the images. They're split up into 10 GB tar files which can be extracted with: cpio -D output_path -ivd -H tar < images.tar.ab However, some of the files will be lost doing it this way, since they're one tar file split into multiple.
Raiders of the Lost Kek 3.5 years of /pol/ posts, June 2016 - November 2019 (in JSON format) Paper: https://deepai.org/publication/raiders-of-the-lost-kek-3-5-years-of-augmented-4chan-posts-from-the-politically-incorrect-board Download: https://zenodo.org/record/3606810 sudo apt-get install zstd unzstd pol_0616-1119_labeled.tar.zst tar -xvf pol_0616-1119_labeled.tar
>>12194 LOL. I'm shocked they published this publicly. Seems likely it's not earning them any brownie points that way? Regardless, better get it while you still can Anon. Save.Everything.
>>12194 Interesting project Anon. Vaguely curious about how much did they Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board >abstract >This paper presents a dataset with over 3.3M threads and 134.5M posts from the Politically Incorrect board (/pol/) of the imageboard forum 4chan, posted over a period of almost 3.5 years (June 2016-November 2019). To the best of our knowledge, this represents the largest publicly available 4chan dataset, providing the community with an archive of posts that have been permanently deleted from 4chan and are otherwise inaccessible. We augment the data with a set of additional labels, including toxicity scores and the named entities mentioned in each post. We also present a statistical analysis of the dataset, providing an overview of what researchers interested in using it can expect, as well as a simple content analysis, shedding light on the most prominent discussion topics, the most popular entities mentioned, and the toxicity level of each post. Overall, we are confident that our work will motivate and assist researchers in studying and understanding 4chan, as well as its role on the greater Web. For instance, we hope this dataset may be used for cross-platform studies of social media, as well as being useful for other types of research like natural language processing. Finally, our dataset can assist qualitative work focusing on in-depth case studies of specific narratives, events, or social theories.
For speech recognition, Mozilla has the voice "Common Voice" corpus with multiple languages, if anyone is interested. English alone is over 2,000 hours of spoken phrases, about 65gb. The other languages I looked at were around 20gb each, but it varies. You can also add to the project yourself by recording or verifying phrases. https://commonvoice.mozilla.org/en
>>14321 Thanks, that might be useful. I wonder if we can also use subtitle files with their related shows and movies.
>>14325 waifusearch clipchan
>>14326 Lol, how could I forget. Blackout, because of insufficient sleep.

Report/Delete/Moderation Forms
Delete
Report

no cookies?