>Do you mean like an n-gram model or something else?
No, not really as I only found out about the term from your post, heh.
I'm simply describing the approach of iteratively building ever-higher dimension data structures using standard-library C++ data containers like std::map, std::vector, and std::set, etc.
There are also other structures like tries
available elsewhere as well.
Not only are these data structure entirely generic in their design (they work exactly the same for a foo
data object as they do for a bar
data object), in general they are also highly-tuned for efficient access and processing.
So for example, if I had a std::map of 1'000'000 unique English words from say, Wikipedia, all nicely compacted into the hashmap form that a std::map uses for such strings (perhaps a matter of just a couple megabytes of active RAM consumption).
Structurally, I could use the string hash as the key
, and then a std::pair of std::set's of std::pair's of unsigned ints (gives a range of 4 billion elements on either side of the association) -- as the overall value
for the map. These indexes would store the words/counts of connected words. The outer pair would keep one set for all preceding words/counts, and one set for all following. (My apologies, if this is confusing in text. In code it would be a matter of two, three lines).
Then once I had searched that hashmap structure for any particular given word, then accessing the index keys for all
preceding words for that search word is just a matter of say, 10 machine instructions or so -- probably even less, depending on ISA. So, quite efficient. Locating all the following
words would also be the same type of mechanism.
Not only is storing a big set of strings efficient in such a container, but accessing or striding into other, related dimensions on that data is also pretty efficient, too. And again, it's an entirely generic
proposition; the data's type is pretty much irrelevant for it to all just werk.
None of this requires any deep expertise to have industrial-strength data processing in just minutes of simple effort. After that it's all creativity.