I'm not well versed in data structure but curious when do you use large quantity of small size data sets or low quantity of big size data sets for storing information as there are so many ways to store it. What are the pros and cons of using each and how do you see fit when to apply which?
I know this post is kinda old, but some resources suggests that having small blocks of data is better in terms of performance, since it is more likely to fit more data in memory pages and cache lines for access. Though I can see the performance impact being minimal using big blocks of data.
First of all, what exactly is a "block"? If you have a contiguous region of 4096 bytes, do you have a single 4096-byte block, or 128 32-byte blocks?
Second, for cache performance the size of your "blocks" don't matter as much as your access patterns. In particular, the cache works most efficiently when you
1. access the same data over and over again, or
2. access different but contiguous data (e.g. when incrementing all the bytes in a 1 GiB buffer).
The cache is most efficient when you have to jump around in memory with no pattern.
An example of a cache-optimize data structure is the unrolled list. https://en.wikipedia.org/wiki/Unrolled_linked_list
In the particular case of an unrolled list, as a rule of thumb the bigger (in terms of bytes) the nodes the more efficiently the cache behaves (during traversal). There are exceptions to this, so when optimizing cache performance the advice is to always profile.
Originally, I was meant to say given that if you have a fixed size of data that you need to store, would there be any differences, whether it be performance, if you use a smaller data type rather than larger ones to store it. (edit: I didn't know how to describe it when I first think about this, my first post is worded very poorly, I'm sorry)
I do understand that cache works most efficiently with spatial and temporal locality. I was thinking about if there is a clear difference like a guideline when to use which if you divide data into smaller chunks or larger chunks. I guess profiling code is the only way to know for sure if that impacts performance.
if you have a fixed size of data that you need to store, would there be any differences, whether it be performance, if you use a smaller data type rather than larger ones to store it.
Well, this is quite different from what you asked in the first post.
Generally speaking, its worthwhile to use the smallest data type capable of representing the full range of values you'll work with, particularly if we're talking about a data structure that may hold arbitrarily-many elements.
Sometimes there are exceptions to this. For example, if your data type is bool, timetimes it makes sense to use a bit-packed data structure (such as std::vector<bool>) because it reduces the space cost, and other times it makes sense to use whole bytes per bool, because it requires less memory bandwidth (changing every bit in a bit-packed octet may cost up to 8 times more accesses, depending on how smart you were when implementing your traversal function).
Thank you for your comprehensive answer. It really helps a lot. I'm sorry that I didn't specify clearly in my first post, I didn't know how to word it before.
I thought about it some more, and I realized what I said earlier about memory bandwidth is actually incorrect. std::vector<bool> and std::vector<unsigned char> should take about the same memory bandwidth to change the bits of an octet.
I realized I don't really know when one should bit-pack and when one shouldn't.
A case might be made for bit-packing if you've got something like a 65-byte structure and a 64-byte cache-line.
If there was a way to shrink that structure to fit in a single cache line, you could quite possibly reduce memory pressure and eliminate false sharing. Ideally you could do this by rearranging members, but if not, perhaps bit-packing is an option. For example, there is often space for a flag (or a few) in the low bits of a pointer. (Of course, this ancient hack is strongly discouraged.)
A useful tool for optimizing such things is pahole.
While you're busy packing your data, you might be expanding the number of instructions(RISC) or more complicated instructions(CISC) you need to fetch to deal with the compressed data. https://en.wikipedia.org/wiki/Space%E2%80%93time_tradeoff
> large quantity of small size data sets or vice versa?
Choose what is logically consistent with the problem to be solved, so you have the most chance of arriving at a bug-free solution in the least amount of time.
It is premature optimisation disease to even think about this until you have all these things in place:
- a complete program (to make sure all the elephants are in the room)
- a set of test cases (so you know that any attempts to improve things don't break anything)
- real world data (it's what your users care about)
- a source control system (so you can easily revert if your optimisation attempts fail).
- a profiler (so you can find actual elephants).
See also https://shed.bike/
Which is in summary is focussing obsessively over the trivial whilst neglecting the important.