Jump to content
kliepatsch

Brute force Double PETSCII search with Python

Recommended Posts

Hi,

following up on the brief discussion in this thread

I wanted to see how far I can get with image to double PETSCII conversion. As a first step I wrote a converter to single PETSCII in Python. It works very well and takes about 6 seconds on my machine to find a very good solution. It can pick a custom color palette that works well for the input image.

As the second step, I wanted to write a brute force double PETSCII search and see how far I can optimize it. My brute force method tries all possible combinations of foreground and background characters and picks the closest colors from the palette for each of them, respectively,  and picks the best combinations. Albeit being very inefficient, this is the gold standard to compare other double PETSCII conversion algorithms against.

A naive single-threaded implementation takes about 25 minutes to convert an image. I am afraid that none of the parallelization techniques I tried worked as well as I hoped. Even with 8 processes on my machine with 8 CPUs it still takes more than 10 minutes to complete. It seems that on my laptop, the search algorithm is severely memory bandwidth limited. (I tried the python modules "threading", "multiprocessing" and also "ray", and I found no significant difference in performance between them).

Anyway, find the script plus supporting files in the attachment. To run it, you need Python3 with the following packages installed: numpy, pillow, scipy, matplotlib

brute_force_double_petscii.zip

  • Like 2

Share this post


Link to post
Share on other sites

The foreground layer can use any of the 256 colors, while the background layer can only use the first 16 colors, but both foreground and background must be set. My script determines a palette and finds the best solutions using that palette. Improvements could be made finding a better/different palette.

See a few examples below:

double_petscii_brute_force_4.png

im03.jpg

double_petscii_brute_force_5.png

im02.jpg

  • Like 3

Share this post


Link to post
Share on other sites

Hey, great stuff!

Yeah multiprocessing can be a bit iffy, it could be that much time is spent passing the data set across different processes to work on it, and passing the results back.

To avoid this, multi threading could help, but then you're sometimes constrained by Python's Global interpreter lock.

I want to take a look at your code and see if I can tweak it a bit to make it faster!   (for reference it runs in 5 minutes 30 seconds on my Ryzen 2700x)

 

  • Like 2

Share this post


Link to post
Share on other sites

What times can you get if you increase the "N_processes" variable? Or did you already do that to achieve that time?

Share this post


Link to post
Share on other sites

Yeah it didn't matter much. I suspect it is taking a lot of overhead somewhere to pickle/unpickle the data passed between the processes. And indeed, switching to multiprocessing.pool.ThreadPool didn't improve things.

I see no obvious big improvements to the code as-is.

But I think it would be worthwhile to investigate what happens if you split the problem up differently. Rather than dividing it over the tokens to search for, I suggest trying an approach where you split up the source image in strips of 8 pixels high and process each strip in a worker process. So one worker processes one line of characters.

I realize this will require quite a bit of rewriting...

Share this post


Link to post
Share on other sites
Posted (edited)

I doubt the limiting factor is data transfer between the processes, as the parallelization is almost trivial. And in fact, I think that each process holds a copy of all the relevant variables (I think they are even initialized in every process, e.g. the image is loaded, the palette determined etc... but I am not totally sure about that) -- so there is no communication necessary, except for the final results. I have tried parallelization with the "ray" library, which claims to solve issues with large numeric datasets and accessing identical data from multiple processes, but it didn't help.

I have thought the same about splitting the problem up differently. I.e. doing more in-place computation and less memory-intensive tasks. One obvious way to achieve this would be to compute each 8x8 pixel tile individually. That way, all necessary variables could fit into CPU cache, so memory accesses should be a lot faster. With that in mind, I ran a few tests on a very simple problem earlier, but couldn't find any noticeable difference between different ways of splitting it up. I guess the present version is roughly as efficient as I can make it. Improvements will be about using better algorithms.

Well, if there *is* an example where you can actually see different performance for different ways of splitting it up, I will be happy to see it!

Edited by kliepatsch
clarification of argument
  • Like 1

Share this post


Link to post
Share on other sites
Posted (edited)

Wow really cool. Did I understand you try all combination of pixels to all combinations of characters, or do you do this per 8x8 square?   The 8x8 square method is what I do, it's not as sophisticated as yours. Your 2petscii images look really close to the original.  I needed to click on it and check it twice, to see it was petscii, and not a bitmap.  🙂.    I did not do 256 color ones yet,  but in your results I see the improvements, they are significant.

Really fun seeing.  I'll test your program when I have some time, I like to see how it runs on my laptop. 

To optimise, for me a really fun experiment would be to go all out, and do like this.
Use the GFX card to do the "mass parallel threading", like with Cuda or so.
This is how they speed up deep learning for example.  Maybe some day 🙂

Great work so far 🙂



 

Edited by CursorKeys

Share this post


Link to post
Share on other sites
Posted (edited)

Are you starting with pictures that are bigger than 640x480 resolution?  The ones I see above are 1000x750.  

This is great.  If we can do this with the PETSCII font, imagine what we can do with a custom font.  PETSCII wasn't originally designed for flipping vertically and horizontally (i.e. we don't need four tiles to be short curves, only one), and the default font on this machine has been modified for easier display on CRT terminals like TV screens; there are very few tiles with a ...010... anywhere in the current font.  A custom font (at least for layer 1) could improve these images significantly.  I think I could re-do the video demos with this technique and cut the video image data by 75% while doubling the resolution; I might be able to get 24fps for a 320x136 video with this technique.

Edited by Ed Minchau

Share this post


Link to post
Share on other sites

Thanks. Yes, the input image is scaled down (and cropped if necessary) to 640x480 pixels. It's then subdivided into 8x8 pixel areas. Then the best match for each 8x8 tile is sought. My implementation does all the 8x8 areas at once, making heavy use of numpy expressions. Numpy is one of, if not the most important Python library, offering highly optimized numerical routines and great flexibility.

Yes, I also thought about using the GPU, but haven't tried yet, but could provide significant speedup. The operations that do the heavy lifting are just arithmetic operations and not many comparisons and branching, so it should be possible to accelerate the method with a GPU, I think.

I think it would be desirable to find an algorithm which finds decent solutions with a fraction of computational effort. My idea is to have a two-stage algorithm which tries all the foreground bitmasks and selects a couple of them, say, 10, according to some clever criteria, and then systematically tries all background combinations for each of the 10 foreground candidates. This would already reduce the computational effort by a factor of 20 to 25 over the dumb brute-force method.

  • Like 2

Share this post


Link to post
Share on other sites

Not sure if it helps. But one other speed optimization, that I've been thinking of is slashing the resolution in the computation phase.

So, lets say, we do a 640x480 picture, and divide it in 8x8 pixels "blocks".  But instead we scratch that, half the resolution to  320x200, and compare with "pretend" 4x4 pixels blocks. (scale down the font)
Then when done, we still have the same (ish) result, but the calculation time is slashed to 1/4th or so.

I think it may work, the reason is the actual pixel perfect compare is maybe not needed since the end result will never be a perfect match. (unless for some special cases).

Anyway, interesting little project, I keep following it, to see where it leads 🙂

  • Like 1

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


×
×
  • Create New...

Important Information

Please review our Terms of Use