/robowaifu/ - DIY Robot Wives

Advancing robotics to a point where anime catgrill meidos in tiny miniskirts are a reality.

Roadmap: file restoration script within a few days, Final Solution alpha in a couple weeks.

Sorry for not being around for so long, will start getting back to it soon.

Max message length: 6144

Drag files to upload or
click here to select them

Maximum 5 files / Maximum size: 20.00 MB

More

(used to delete files and postings)


NLP General Robowaifu Technician 09/10/2019 (Tue) 05:56:12 No.77
AI Natural Language Processing general thread

>"Natural language processing is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages."
en.wikipedia.org/wiki/Natural_language_processing
https://archive.is/OX9IF

>Computing Machinery and Intelligence
en.wikipedia.org/wiki/Computing_Machinery_and_Intelligence
https://archive.is/pUaq4

m.mind.oxfordjournals.org/content/LIX/236/433.full.pdf
>Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
https://rajpurkar.github.io/SQuAD-explorer/
Know What You Don't Know: Unanswerable Questions for SQuAD
https://arxiv.org/abs/1806.03822
>Google Brain’s Neural Network AI has reportedly created its own universal language, which allows the system to translate between other languages without knowing them."
www.breitbart.com/tech/2016/11/26/google-ai-creates-its-own-language-to-translate-languages-it-doesnt-know/
https://archive.is/Ky2ng

research.googleblog.com/2016/11/zero-shot-translation-with-googles.html
https://archive.is/MJ0lN
>>909
Google's Neural Machine Translation paper itself.
arxiv.org/abs/1611.04558
https://archive.is/Y9WY7
Stanford Log-linear Part-Of-Speech Tagger

>A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc…"
Written in Java.
nlp.stanford.edu/software/tagger.shtml
https://archive.is/ybYtO
>>911
SO tag.
stackoverflow.com/search?q=stanford-nlp
https://archive.is/qtysd
>>911
>>912
StanfordNLP's github. I'm currently playing with the GloVe repo.
github.com/stanfordnlp
https://archive.is/UbhQF
NLTK, a set of Python modules for learning and developing NLP applications.
www.nltk.org/

>The book is well-written, with plenty of examples and explanations for key NLP concepts."
www.nltk.org/book/
>>914
I always see you post your lewd robot drawings, but you're pretty serious about making love to robots if you're on a board like this, aren't you?
>>915
If you aren't then why would you come to this place?
>>915
I guess so.
>making love
You couldn't pick a better choice of words.

Another interesting reference, closely related to NLP.
>Computational linguistics is the scientific and engineering discipline concerned with understanding written and spoken language from a computational perspective, and building artifacts that usefully process and produce language, either in bulk or in a dialogue setting."
plato.stanford.edu/entries/computational-linguistics/

The SEP in general contains many such state-of-the-art surveys. This one even mentions vidya and VNs at some point (Section 10.6, Applications).
>>917
But sadly, you couldn't create a robowaifu that has true AI, as in it actually thinks.
It's technically possible today, but completely unfeasible. It would also be something more along the lines of GLaDOS than normal sized humanoid with completely unconventional hardware and software that wouldn't function anything like what we use today.
>>918
I think as long as it's convincing enough I'd be happy with it
>tfw poor sap studying linguistics in college
>the more I learn, the more I become aware of what a fragile, complex and in the end absurd and ever-fluctuating system languages are that rely on second guessing yourself and your conversational partner at least half of the time
>grow more and more convinced that the fact we can communicate at all is either a big misunderstanding or a miracle as great as life itself
>tfw people are attempting to make machines understand the complexities of human language based on a binary system of truth values
>>920
well wha'am tryana say is 無理で、ごめん
>>920
>tfw people are attempting to make machines understand the complexities of human language based on a binary system of truth values
Better to have loved and lost, than to never have loved at all anon. Or something like that. I may never be able to understand everything in this life, but I'm going to tack a whack at it!
:^)
>>920
>>919
>convincing enough
Maybe that's all we really are to each other.
>>920
The reality is that we're just brute forcing it.
While we do not exactly remember every sentence we've ever heard and read, we remember patterns and use them if we've seen them enough times to know the right context.
>>923
>Maybe that's all we really are to each other.
Heh, maybe. But I suspect intangibles are there that go deeper than just language tbh.

>>924
Yes, I think all physical life very pattern based and dependent on learning them successfully. Fortunately the purely mathematical approach to machine learning that is now standard today does this incredibly efficiently. It's the underlying basis on both the improvements recently in audio and video recognition systems, and is the underlying approach that will soon give us self driving cars. All of this plays into bring robowaifus about too ofc.
[[259
NLP course
[[1474
Open file (86.82 KB 800x250 0704234652024_25_vox.png)
hackaday.com/2018/01/17/speech-recognition-for-linux-gets-a-little-closer/
Open file (9.89 KB 480x360 0.jpg)
>>928
Thanks Anon, great info!

blog.mikeasoft.com/2017/12/30/speech-recognition-mozillas-deepspeech-gstreamer-and-ibus/

github.com/mozilla/DeepSpeech

voice.mozilla.org/en

I'm adamantly opposed to the SJW shitfest Mozilla has become now, but the project license should allow us to do whatever we want with it AFAIK, including creating cuddly cute robowaifus. I'd like any legalanons to confirm that or no please.
github.com/mozilla/DeepSpeech/blob/master/LICENSE

on-topic related crosspost (possibly a dup ITT)
[[2908

https://www.invidio.us/watch?v=NtZipf0BxKg
Open file (3.72 KB 67x118 file.png)
via
towardsdatascience.com/one-language-model-to-rule-them-all-26f802c90660?gi=29cf22c0e2d8
, which anon pointed out here. [[5085

Discusses an NLU approach by the OpenAI guys that apparently outperforms GPT-2, and is entirely unsupervised. Use the WebText corpus.
Open file (452.37 KB 1210.6293.pdf)
I propose we approach teaching our waifus English in a similar way that an ESL student (casually or otherwise) needs only to learn a smaller subset of the language to become reasonably proficient and from there can branch off into a more wide-ranging vocabulary. Since we would have to start somewhere with such an effort I offer this list as a reasonably good starting point for a short list of English words from Cambridge University for such a purpose. >
>>7005 Thanks, this might become handy. But, I think this makes more sense in regard to the speech recognition part, since this is harder. Imagine we had a database of definitions of words. So we could have a long list of words which are for example fruits or means of movement, then I wouldn't filter out many or even any of them, but putting it into the model / graph database. Though, a waifu could have smaller databases for more common things, things at home, things she experienced, more common words, ...
>>7005 Interestingly AI also learns a lot more efficiently when taught with simple lessons first but it's not really clear yet on how to structure teaching something from absolute zero knowledge beyond giving simple data first. There might come a day when we can analyze our models by seeing if they fail certain tests which will suggest which data needs to be trained on. Generative teaching networks might provide some insight here. ESL students also practice learning to discern the correct words to use by being given a sentence with masked words and multiple choices to pick from, which is similar to using a contrastive triplet loss function that learns much faster than trying to guess from scratch. I think it's going to be useful not to focus too much on language alone. We should try thinking outside the box to speed up learning. Input from people not familiar with AI might provide more useful ideas here. I've been training a tiny GPT model (80M parameters vs. the smallest 345M parameter model) and it plateaued pretty hard at a perplexity of 32. No amount of new English data could improve the model further so I started training it on C++ and Python code instead and now it's reaching 24 on English with 72 on C++ compared to 400 when it started. For those not familiar with word perplexity it just means it has the same probability of being correct as randomly picking a word from x number of words containing the correct word. Native-level perplexity is estimated to be about 12 on common English but it depends on the difficulty of the text. In 2-3 months I'll take a shot at generating a dataset with these ESL words and see if it's possible to achieve a minimal amount of perplexity using a minimal amount of data. The approach I take to learning Japanese is by learning the most frequent words first because it's the biggest return on investment and I can guess the meaning of most sentences even though I don't know that many words. This will be a much more intelligent approach than Big Tech's way of feeding an endless amount of uncurated data and GPUs in with no actual improvement to the models or training method. I threw an entire GB of English data at my model for example and it made no improvement at all but just 10 MB of C++ code to practice on and it started improving on English like magic. This is the kind of data we want to find. It gives me an idea to build a 'curative mentor network' that tests a student network to figure out which data the student needs to improve rapidly, rather than only attempting to invent new helpful data like a generative teaching network.
>>7098 >We should try thinking outside the box to speed up learning. OK, here's my take: As humans, we are innately wired for language in the very design of our neural tissue, probably even into the realm of the design of our DNA. This is part of us created inside us before our births even. Thereafter, as young children we begin picking up language by parroting our parents/siblings/etc. and gauging the responses socially-speaking. All of these training shots stored in our near & long-term memories are also guided by the context environmentally-speaking. In other words, you learn very young the right things to say and the right times to say them based on purely external factors. Parent's auto-selecting 'baby-talk' and then progressively more complex may be a factor in how quickly we progress. But it's the astounding facility of the imaginations we possess that allow us to begin to quickly internalize these lessons, and then to creatively reassemble them into novel ideas (and effectively thence into an increased vocabulary) is probably an important key to understanding what differentiates us from mere animals. A materialst view that we are nothing but a mere specialization of them will be fruitless IMO, we are simply qualitatively different. This is part of what sets us apart, and it will behoove us in our efforts here to try to unravel what this distinction is and how to capitalize on it. >perplexity So just to check if I'm getting some idea of the meaning, a lower score is indicative of more accuracy inside the model? >so I started training it on C++ and Python code instead <[desire to know more intensifies]* >The approach I take to learning Japanese is by learning the most frequent words first because it's the biggest return on investment and I can guess the meaning of most sentences even though I don't know that many words. Seems sensible on the surface of it to me. >This is the kind of data we want to find. <again, gibb more details plox. >This is the kind of data we want to find. It gives me an idea to build a 'curative mentor network' that tests a student network to figure out which data the student needs to improve rapidly, rather than only attempting to invent new helpful data like a generative teaching network. This sounds both interesting and innovative, but I have little idea what that even means, much less how it would be implemented. I just hope we can create robowaifus than can interact immersively with us using nothing but their internal, mobile, onboard computation systems. Just like on my favorite Chinese Cartoons.
>>7103 >But it's the astounding facility of the imaginations we possess that allow us to begin to quickly internalize these lessons, and then to creatively reassemble them into novel ideas (and effectively thence into an increased vocabulary) is probably an important key to understanding what differentiates us from mere animals. This gives me an idea to train GPT2 on its own imagined predictions. A generative adversarial loss could also be added so predictions become more natural over time. >In other words, you learn very young the right things to say and the right times to say them based on purely external factors. Something I've been working to implement is giving rewards for each word generated. If the model generates a mostly good sentence but messes one word up, it can get direct and specific feedback on what was good and what was bad. Suggestions on which words could be used in place of a mistake could also be given as feedback rather than just a meaningless reward. Eventually I'd like to expand this to a completely non-linear reward function generated from human feedback so you can just say 'stop doing that' or 'I like it when you say that' and the model will understand and adjust itself accordingly, rather than the ridiculous approach in RL right now to maximize gathering as much low-hanging fruit as possible. I think it's a big mistake trying to automate everything. A model can only be as good as its loss function and there is no better source of data than a live human being to interact with. Machine learning has everything backwards by trying to generate context from rules created from lifeless data. The way people actually behave is by generating rules from the context. AI needs to be capable of adjusting itself on the fly according to the context rather than being stuck approaching everything the same way. Even with our own great capacity to learn and adapt, children raised in the wild are almost impossible to rehabilitate into society. AI behaves in a similar way. When a model becomes habituated or overtrained on a certain type of data it becomes very difficult to retrain it without re-initializing weights. Most of our intelligence isn't really innate but instilled in us through our culture and interactions with others. If someone lived in the wild all their life, they wouldn't even dream of landing a rocket on Mars or creating a society of catgirl meidos because the mind is limited to recombining and remixing the data it has gathered. >So just to check if I'm getting some idea of the meaning, a lower score is indicative of more accuracy inside the model? Yeah. The idea is to create a teacher network of sorts that arranges the data into a curriculum by testing different arrangements and seeing which order of data the student networks achieve the lowest perplexity/most accurate predictions with, similar to generative teaching networks but using pre-existing data instead of generated. AI models learn best when they learn simple skills first and progress to much harder ones. They also seem to benefit from learning unrelated skills such as training on C++ code. I'm not sure what it's actually learning from them to achieve a breakthrough but I imagine it's similar to how we learn to do new things like cook food in a microwave by someone noticing a chocolate bar melting in his pocket next to radar tube, or how doing basketball drills looks nothing like basketball but improves a person's game. >I just hope we can create robowaifus than can interact immersively with us using nothing but their internal, mobile, onboard computation systems. Same, we got a lot of work to do though to make them that efficient. I think we will look back at these times and laugh at AI algorithms of today like they're vacuum tubes. Recently I figured out how to translate a dense n-dimensional encoding into any arbitrary new dense encoding with perfect accuracy. However ideally it requires O(2^n) parameters to solve it in a few thousand steps. I managed to reduce this down to O(n^2) and still eventually reach 100% but it learns far slower and is still too many parameters. For GPT2 to perform well the embedding needs at least 1024 dimensions. With my techniques I've managed to bring this down to 256 and achieve encouraging results so far but it's still going to take weeks to find where it bottoms out. Despite this setback, to train vanilla GPT2 from scratch on this GPU it would take over 1400 years so I think I'm on the right track at least. I don't even like GPT2 because it uses so many parameters and has no useful holistic hidden representation of the context, but tinkering around with it and having so many constraints to workaround has taught me a lot.
Open file (16.94 KB 400x265 FROESCHE7.jpg)
I'm currently working on my project which uses a scripted system as outer layer >>7605 but of course will use other programs to analyse inputs and create it's own responses. I'm currently looking into descriptions and tutorials of existing software and libraries which implement methods to parse and analyze language. I just want to put some notes in here, to prevent others to reimplement things which are already there or to generally know what's already easily possible. Recently I'm going through Python NLTK. You can generally get help in within Python shell like ipython by using help(nltk) after "import nltk" and then for submodules like help(nltk.submodule). For today, here just some examples for things which are already solved: - calculate the frequency distribution tokens e.g. words - also categorized word lists like stopwords, such as "the," "of," "a," "an," come with it and can be filtered out from results of a frequency distribution - Tokenization is the term for splitting texts. This can be done per word or sentence. Unlike a simple split function it acknowledges things like "Mr." as a word on it's own, not a sentence. Therefore it behaves different from language to language. - You can also get a synonym or antonyms (opposite words) from NLTK in very few lines, same for definitions and examples of a chosen terms. - Removing affixes from words and returning the root word is called (word) stemming. NLTK can also do that for us. E.g. dance is the stem of dancing. Here is tutorial describing these things in detail: https://likegeeks.com/nlp-tutorial-using-python-nltk/ which is often only a few lines of code per problem. I will go on with my list here another day and write about my experiences.
>>7837 >- Tokenization is the term for splitting texts. This can be done per word or sentence. Unlike a simple split function it acknowledges things like "Mr." as a word on it's own, not a sentence. Therefore it behaves different from language to language. Ahh, makes sense. Thanks for taking the time to explain that Anon. >I will go on with my list here another day and write about my experiences. Look forward to it, thanks for the link.
What is your opinion using higher-order Markov chains for text generation? Would it be useful, or is it too simplistic?
>>8200 >using higher-order Markov chains for text generation Sure why not? >Would it be useful Yes, ofc. >or is it too simplistic? Again, yes. All NN-based approaches are too simplistic for now. GPT-3 is the best we've done atm, yet it's entirely brain-dead and doesn't 'understand' a word it's saying. It's behaviour is simply a yuge statistical model based on actual human's text output (well, mostly humans heh). Just start where you're at Anon. You'll get better as you go along. If you have an implementation of higher-order Markov chains for text generation that you can understand how it works well enough to modify it successfully to more closely align with your goals, then use it! You're already in the top 1/2 of 1 percent of the population in this regard. BTW, please share your progress and code here on /robowaifu/ as you go along, thanks.
>ivan-bilan/The-NLP-Pandect >A comprehensive reference for all topics related to Natural Language Processing >This pandect (πανδέκτης is Ancient Greek for encyclopedia) was created to help you find almost anything related to Natural Language Processing that is available online. https://github.com/ivan-bilan/The-NLP-Pandect
>>8496 Thanks Anon, repo cloned. That's a lot of links!
>>8496 Thanks I discovered EleutherAI through your post. Apparently it's a group attempting to create a fully open-source alternative to (((OpenAI)))'s GPT-3. We wish Godspeed to them, obviously. https://eleuther.ai/ https://github.com/EleutherAI

Report/Delete/Moderation Forms
Delete
Report

Captcha (required for reports)

no cookies?