A few things here (from someone who works in the industry and with ChatGPT and ChatGPT-like models daily).
First, GPT 3.5 (the free version) is very very different from GPT 4. GPT 4 solves all kinds of problems that 3.5 can only guess at. If you have $20 to spare, sign up for the Plus membership and give it a shot. I think you might be surprised at the difference in quality/skill of the model. If you post an anonymized version of the problem here I’m happy to check for you.
Second, I know it seems unintuitive, but logic/math problems are actually the worst for this type of model. I don’t think a super satisfying non-technical explanation for why exists, but it basically comes down to the fact that the model is learning about logic entirely through language. It has no built-in “logic circuit” of it’s own. The way these models are trained is “given part of a sentence, predict the next word”. In some cases that next token might be hard to guess just looking at the words in the sentence, maybe it’s a new sentence the model has never seen before? So the model does to some extent learn the meaning of words and how the concepts are related, because it has to, if it wants to guess the next word correctly. You could imagine an example as something like “If I have a feather on a plate above a bed, and turn the plate upside down and right side up again, the feather is on the…”. The correct answer, “bed” is really only obvious if you already know what plates are, what feathers are, what gravity is, etc.
Another thing that can help these types of model is asking them to “reason about the problem step by step”. This is because the models have no “internal monologue”. What you see on the screen is it’s entire “thought process”. So if you ask it to “think out loud” sometimes it will do better. Another step further might be to ask GPT to write a program to figure out the answer! Sometimes it will be able to write a correct program, even though it couldn’t figure out the answer itself.