OpenAI’s gpt-image-2 Has a Secret World

The Accident

I would love to say this started with a careful research protocol, a crisp hypothesis, and a whiteboard full of sober little boxes.

It did not.

I was testing my Shellsensor virtual world work, and I was annoyed. The idea was simple enough: treat an image model like a slow-motion spatial renderer. Move forward a little, render the next view, carry the world state, repeat. A kind of tiny Google Maps for a fictional place, except the map is generated one frame at a time and the model occasionally decides that doors are more of a vibe than a contract.

The app version was wobbling. Rooms drifted. Objects migrated. I could not tell whether my harness was failing, whether the model was failing, or whether both of us had entered into an unhelpful partnership.

So I did the lazy test. I opened the ChatGPT mobile app, grabbed a generated suburban image, made sure image creation was selected, and typed the kind of instruction that sounds too dumb to be useful: step forward 20 steps.

It worked. Not perfectly, but enough. The scene moved forward. The geometry mostly held. The model did not dissolve into abstract wallpaper. It kept giving me more of the place.

Generated seed image showing a suburban garage and a fictional man in the scene. — Exploration 00, frame 0000. The generated garage seed that became the baseline world.

Then John Appeared

The strange part was not that the image could move. Image models can already preserve more local structure than people expect. The strange part was that the world started becoming socially specific.

There was a man in the scene. I did not name him at first. The harness eventually settled on John, which is exactly the kind of name a generated suburban man would have if a model were quietly trying to avoid calling attention to itself.

John had a house. John had a garage. John had doors that had to be opened and closed. He had a phone, and the phone had messages. He had errands. He had implied relationships. The whole thing started to feel less like walking through a static picture and more like bumping into the boring accounting layer of a small generated life.

That does not mean John is real. It means the model has learned an enormous amount of domestic pattern. If you ask it to keep a world coherent long enough, it does not drift toward pure weirdness. It often drifts toward the most statistically overbuilt environment imaginable: suburbia.

The world did not become magical. It became a place with a garage door, a phone reminder, and the looming threat of errands.

Generated close-up of John holding a phone in the garage with a reminder visible. — Exploration 00, frame 0002. The phone becomes part of the action surface, not just a prop.

The Harness

After the mobile test, I stopped treating this as a cute party trick and wrote an autonomous world explorer. The harness keeps a world-state summary, picks a plausible next action, renders a frame, analyzes the resulting image, and feeds that back into the next step.

That loop sounds clean when written in one sentence. In practice it is a pile of small arguments with reality. What location is John in? Is the phone visible? Did he actually send the message, or did the model merely imply that a message was sent because phones usually lead to messages? Is the truck in the driveway, the garage, the parking lot, or a theological superposition of all three?

The interesting work is not just generation. It is accounting. The harness has to remember completed steps, suppress repeated actions, preserve actor identity, track objects, and prevent the model from skipping directly to the emotionally convenient part of the story. A world is not a pretty picture. A world is a ledger of boring commitments.

Working definition: John’s World is not a game engine. It is a generated continuity experiment where each frame tries to preserve enough state that the next frame has something to inherit.

Why This Is Worth Writing Down

The tempting version of this story is, “Look, the model invented a little world.” That is fun, but it is also too soft. The useful version is more specific: image models can sometimes hold a surprising amount of implicit world grammar, but they do not automatically know which facts matter across time.

They remember texture better than obligation. They preserve vibes better than object identity. They know a grocery errand has a store, a basket, a parking lot, and a car, but they will happily fold the boring transitions into a single jump if the harness lets them.

That makes John’s World useful. It is a weird, domestic stress test for continuity. The failures are not random noise. They are pressure points: object tracking, transition accounting, repeated preconditions, vehicle identity, phone-state drift, and the model’s habit of completing a story beat before the system has earned it.

So this section is the lab notebook. The images are generated. The man is fictional. The suburbia is synthetic. The failure modes, however, are real enough to be useful.