Welcome to Dream.In.Code
Become a C++ Expert!

Join 137,185 C++ Programmers for FREE! Get instant access to thousands of C++ experts, tutorials, code snippets, and more! There are 2,381 people online right now. Registration is fast and FREE... Join Now!




Writing file compression tool

 
Reply to this topicStart new topic

Writing file compression tool

aristeiaguy
28 May, 2008 - 09:40 PM
Post #1

New D.I.C Head
*

Joined: 28 May, 2008
Posts: 3

Hello all,
Just signed up and had a question for you guys. I am planning on taking a stab at writing a compression tool and I had a question. My question is this:
- How do you read the individual bytes that make up a file, and create a new file from a binary structure you make up?
In other words:
How can you tell that hello.txt's binary representation on the hard drive is 000101011, and then how do you create a file whose binary representation is 00101.

This might be a little hard to follow, and may need a whole lot of explanation, but a point in the right direction (links,tuts,etc) would be a lot of help.

Thanks in advance!

-Aristeiaguy
User is offlineProfile CardPM
+Quote Post

aristeiaguy
RE: Writing File Compression Tool
28 May, 2008 - 09:47 PM
Post #2

New D.I.C Head
*

Joined: 28 May, 2008
Posts: 3

I found another tutorial that seems to be a bit of help, but if someone could take it a bit further that would be much appreciated.

http://www.dreamincode.net/forums/index.ph...;hl=compression

I understand reading it a byte at a time, but does anyone have the best way ( most efficient to memory is what I'm thinking atm ) to implement this. I don't want to write a file compression tool that never compresses a file because it takes too long ;p Also, sorry for the double post on my first thread ><;
User is offlineProfile CardPM
+Quote Post

joske
RE: Writing File Compression Tool
29 May, 2008 - 01:15 AM
Post #3

D.I.C Head
**

Joined: 4 Sep, 2007
Posts: 158



Thanked: 12 times
My Contributions
First try to get something working. Next you can see if you can optimize things more. How to read, process, and write the compressed file can depend on how your compressing algorithm works, so it is somewhat difficult to say what the most efficient combination is right now. Maybe you have to experiment with a couple of solutions to see what works quickest.
User is offlineProfile CardPM
+Quote Post

aristeiaguy
RE: Writing File Compression Tool
29 May, 2008 - 05:37 AM
Post #4

New D.I.C Head
*

Joined: 28 May, 2008
Posts: 3

Thanks for the quick reply whatsthat.gif
Well, I have a problem with reading my file, or at least with formatting the output. Let's say I have a text file with the word hello stored in it. How can I see the binary composition of the file itself? I've tried opening the file and reading it in using an unsigned char, but all I can output is exactly what is stored inside the file ><. I'm just trying to see it's binary representation, and I know that there should probably be more to the file, like some type of header data that signifies what type of file it is, and I want to be able to see this in binary. Let me know what I'm missing if you can. Again, sorry if this is a dumb question, but I'm really looking forward to tinkering around with data compression.
User is offlineProfile CardPM
+Quote Post

joske
RE: Writing File Compression Tool
29 May, 2008 - 06:43 AM
Post #5

D.I.C Head
**

Joined: 4 Sep, 2007
Posts: 158



Thanked: 12 times
My Contributions
A text file does not have any header, it is a plain format. Files like images do have headers containing information about the size and the type of the image.

First: to get familiar with the binary content of any file, you can use a text editor with support for HEX representation. For example PSpad can do this: open any file, then select menu "View", "HEX edit mode". You will see each byte of the file represented as a hexadecimal value. (two hexadecimal digits per byte)

Next, you can try to read any file binary, byte by byte, in your own c++ program. You can start with printing the values of the bytes you read and checking if you are correctly reading the file. Then, you can try to store the bytes you have read in a new file (without changing the content in the first instance), and see if the file is still not corrupted. If that goes well, then you are reading and writing the data correctly smile.gif.

This post has been edited by joske: 29 May, 2008 - 06:46 AM
User is offlineProfile CardPM
+Quote Post

perfectly.insane
RE: Writing File Compression Tool
29 May, 2008 - 03:45 PM
Post #6

D.I.C Addict
Group Icon

Joined: 22 Mar, 2008
Posts: 558



Thanked: 46 times
Dream Kudos: 25
Expert In: C/C++

My Contributions
If you're doing prototyping, it might be helpful to use an interface that allows you to read a bit at a time. C++ makes this easy, but it's not necessarily the most efficient way of doing it. However, for testing purposes, it's probably adequate.

Here's an example that maps a given file into memory on MS Windows and allows one to flip bits one by one:

cpp

// bitvector_impl_win.hpp
#ifndef __BITVECTOR_IMPL_WIN_HPP__
#define __BITVECTOR_IMPL_WIN_HPP__

#include <windows.h>
#include <stdexcept>

#define MMAP_PAGE_SIZE (1 << 30)

namespace util {
class mmbitset_impl
{
public:
typedef mmbitset_impl self_type;
typedef unsigned long long size_type;

mmbitset_impl(char* filename, size_type len);
~mmbitset_impl();

bool isset(size_type bit);
void set(size_type bit, bool value);
size_type size() { return size_; }

protected:
void create_mapping(char* filename);
void destroy_mapping();

size_type correct_mapping(size_type bit);
void guard_bit_operation(size_type bit) { if(bit >= size_) throw std::range_error("bit >= size_"); }

private:
size_type size_;
size_type bytes_;
HANDLE fh_;
HANDLE mh_;
void* memmap_;
size_type view_start_;
size_type view_end_;
};


mmbitset_impl::mmbitset_impl(char* filename, mmbitset_impl::size_type len)
: size_(len), bytes_(len >> 3), fh_(NULL), mh_(NULL), memmap_(NULL),
view_start_(0), view_end_(0)
{
create_mapping(filename);
}

mmbitset_impl::~mmbitset_impl() { destroy_mapping(); }

void mmbitset_impl::create_mapping(char* filename)
{
fh_ = CreateFileA(filename, GENERIC_READ | GENERIC_WRITE,
FILE_SHARE_READ | FILE_SHARE_WRITE, NULL,
OPEN_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);

if(fh_ == INVALID_HANDLE_VALUE)
throw std::runtime_error("Error opening file");

mh_ = CreateFileMapping(fh_, NULL, PAGE_READWRITE, bytes_ >> 32, bytes_ & 0xFFFFFFFF, NULL);

if(mh_ == NULL) {
destroy_mapping();
throw std::runtime_error("Error creating file mapping.");
}

}

void mmbitset_impl::destroy_mapping()
{
if(NULL != mh_) {
if(NULL != memmap_) {
UnmapViewOfFile(memmap_);
}
CloseHandle(mh_);
}

if(INVALID_HANDLE_VALUE != fh_)
CloseHandle(fh_);
}

mmbitset_impl::size_type mmbitset_impl::correct_mapping(mmbitset_impl::size_type bit)
{
if(NULL != memmap_) {
if(bit < view_start_ || bit >= view_end_) {
UnmapViewOfFile(memmap_);
memmap_ = NULL;
}
}

if(NULL == memmap_) {
// Find page lower bound.
view_start_ = bit & ~(MMAP_PAGE_SIZE - 1);
view_end_ = std::min(view_start_ + MMAP_PAGE_SIZE, size_);

memmap_ = MapViewOfFile(mh_, FILE_MAP_ALL_ACCESS, view_start_ >> 35, (view_start_ >> 3) & 0xFFFFFFFF,
((view_end_ - view_start_) >> 3) & 0xFFFFFFFF);

if(memmap_ == NULL)
throw std::runtime_error("Failed to create view of memory mapped file.");

}

return bit & (MMAP_PAGE_SIZE - 1);
}

void mmbitset_impl::set(mmbitset_impl::size_type bit, bool value)
{
guard_bit_operation(bit);
mmbitset_impl::size_type sbit = correct_mapping(bit);
unsigned char* pbyte = reinterpret_cast<unsigned char*>(memmap_) + (sbit >> 3);
if(value)
*pbyte |= (1 << (sbit & 0x7));
else
*pbyte &= ~(1 << (sbit & 0x7));
}

bool mmbitset_impl::isset(mmbitset_impl::size_type bit)
{
guard_bit_operation(bit);
mmbitset_impl::size_type sbit = correct_mapping(bit);
unsigned char* pbyte = reinterpret_cast<unsigned char*>(memmap_) + (sbit >> 3);
return (*pbyte & (1 << (sbit & 0x7)));
}
}

#endif

// bitvector.hpp
#ifndef __BITVECTOR_HPP__
#define __BITVECTOR_HPP__

namespace util {
class mmbitset_impl;

class mmbitset
{
public:
typedef mmbitset self_type;
typedef unsigned long long size_type;

mmbitset(char* filename, size_type len);
~mmbitset();

bool isset(size_type bit);
void set(size_type bit);
void reset(size_type bit);
void set(size_type bit, bool value);
size_type size();

private:
mmbitset_impl* pimpl_;
};
}

#endif

// bitvector.cpp
#include "bitvector.hpp"
#include "bitvector_impl_win.hpp"

namespace util {

mmbitset::mmbitset(char* filename, size_type len)
: pimpl_(new mmbitset_impl(filename, len))
{
}

mmbitset::~mmbitset()
{
delete pimpl_;
}

bool mmbitset::isset(size_type bit) { return pimpl_->isset(bit); }
void mmbitset::set(size_type bit) { pimpl_->set(bit, true); }
void mmbitset::reset(size_type bit) { pimpl_->set(bit, false); }
void mmbitset::set(size_type bit, bool value) { pimpl_->set(bit, value); }
mmbitset::size_type mmbitset::size() { return pimpl_->size(); }
}



Here is a piece that actually uses it:
cpp

#include <iostream>
#include <stdexcept>
#include "bitvector.hpp"

int doit()
{
using namespace util;
mmbitset bs("g:\\hugeset.bin", 0x8000000ull);
mmbitset::size_type n = 0;

for(mmbitset::size_type x = 2; x != bs.size(); x++) {
if(!bs.isset(x)) n++;
}

std::cout << n << " primes found." << std::endl;

return 0;
}

int main()
{
try {
return doit();
}
catch(std::exception& e) {
std::cout << "Exception caught: " << e.what() << std::endl;
}
return 1;
}


That basically reads the first 16MB of the hugeset.bin (which is 512MB total) and counts every bit in the file that is a 1. Each 1 bit indicates a prime number in a sieve of eratosthenes, so the 23rd bit in the file will be 1 as the number 23 is prime. So, the result is the number of primes below 134,217,728.

Here is the time it took:
$ time ./countprimes.exe
7603553 primes found.

real 0m21.718s
user 0m0.031s
sys 0m0.000s

Using mmbitset_impl directly will shave a second or two off this timing.

Of course, for your functionality, you'd need to make a wrapper that cares about how large a file already is (wasn't necessary for my goals).
User is offlineProfile CardPM
+Quote Post

mikeblas
RE: Writing File Compression Tool
31 May, 2008 - 02:27 PM
Post #7

D.I.C Head
**

Joined: 8 Feb, 2008
Posts: 155



Thanked: 1 times
My Contributions
QUOTE(perfectly.insane @ 29 May, 2008 - 04:45 PM) *
Here's an example that maps a given file into memory on MS Windows and allows one to flip bits one by one:
Note that this example uses memory-mapped files, which have pretty serious performance problems and aren't always a great idea. For somoene just starting out, I think they're a little too complicated. Sticking with fread() is probably more appropriate.
User is offlineProfile CardPM
+Quote Post

Reply to this topicStart new topic
Time is now: 12/4/08 11:29AM

Live C++ Help!

C++ Tutorials

Reference Sheets

C++ Snippets

DIC Chatroom

Bye Bye Ads

Monthly Drawing

Thumb Drive

Top Contributors

Top 10 Kudos This Month