If you use the same time reference for both, you can measure each speaker response separately and then calculate the sum in REW. I'm somewhat wondering what you are describing as a hypothesis though so let's break this down...
Do you understand and agree with my key point? This is, if you were able to EQ the amplitude response of each speaker at the listening position to be identically smooth (flat or whatever shape you want), the sum would
not be flat if there were differences in the phase responses of the two. This is mathematical and therefore acoustical
fact, not a some sort of hypothesis.
To take an extreme example just to maybe help explain the point - if you have an equal amplitude from each speaker but they end up being perfectly out of phase at the listening position, then they will sum to zero. Consider panning the signal level from all on the left, through equal on each channel, to all on the right. What would be heard at the listening position would be that the sound level will start at one level and gradually drop as the signal moves to the center, and then rise again as it's moved to the right to get back up to the starting level. For the low-bass frequencies I'm talking about this is purely a level change, not something that affects the direction the sound appears to be coming from.
You very likely don't normally look at the phase responses but for info. when the measured amplitude responses for the two speakers are different it is basically guaranteed that the phase responses will differ too. It's a consideration I've likely looked at more as it affects my choice of crossover frequency between main speakers and subwoofer. I first realised the issue when considering EQ for my own speakers but then switched to using a subwoofer which removed the complication

.
For clarity let me also stress that I have not dogmatically said something like 'everyone must EQ low-bass based on the sum of left and right speaker signals'. What I've done is explain that there is reason to consider both the individual and summed signals. The nature of low-bass signals in I suspect most recordings does though make me lean towards prioritising using the combined response. What this certainly means is that I won't pronounce as 'wrong' someone who does this and likes the result.
I'm afraid what I've shared above has come from my own brain, based on application of my own knowledge and experience rather than something that I can point you to some particular source for.