2nd International Conference on Computer Science and Engineering, UBMK 2017, Antalya, Türkiye, 5 - 08 Ekim 2017, ss.34-37, (Tam Metin Bildiri)
In this article, we present a novel word-based lossless compression algorithm for text files which uses a semi-static model. We named our algorithm as Multi-stream Word-based Compression Algorithm (MWCA), because it stores the compressed forms of the words in three individual streams depending on their frequencies in the text. It also stores two dictionaries and a bit vector as a side information. In our experiments MWCA obtains compression ratio over 3,23 bpc on average and 2,88 bpc on files larger than 50 MB. If a variable length encoder like Huffman Coding is used after MWCA, given ratios will reduce to 2,63 and 2,44 bpc respectively. With the advantage of its multi-stream structure MWCA could become a good solution especially for storing and searching big text data.