|  | Sorting a file with many already-sorted chunks |  | |
| | | Guest |  |
| Posted: Thu Aug 21, 2008 2:56 pm Post subject: Sorting a file with many already-sorted chunks |  |
What is the best sorting algorithm if the file consists of many chunks of various sizes, each chunk is already ordered, but they are out of order? Example:
38013 38014 38015 33016 33017 33018 44034 44035 44036 44037 44038 44039 44040 44041 44042 44043 44044 44045 44046 44047 34892 34893 34894 34895 34896 34897 34898 34899 34900 34901 34902 34903 34904 34905 34906 38016 38017 38018 38019 38020
Here we have five already-ordered chunks, length 3, 3, 14, 15 and 5. I would think QuickSort is the best in such situation, but I am not sure. |
| |
| | | Richard Heathfield |  |
| Posted: Thu Aug 21, 2008 2:56 pm Post subject: Re: Sorting a file with many already-sorted chunks |  |
ilya2@rcn.com said:
| Quote: | What is the best sorting algorithm if the file consists of many chunks of various sizes, each chunk is already ordered, but they are out of order?
|
Expect improvements from others on this idea:
start a new group while the read of the next datum succeeds compare it to the previous datum (if there is one) in this group if it's greater, add it to this group otherwise, start a new group, and add this datum to the new group, which becomes the "current" group, the "this" group wend merge the groups in the canonical way
-- Richard Heathfield <http://www.cpax.org.uk> Email: -http://www. +rjh@ Google users: <http://www.cpax.org.uk/prg/writings/googly.php> "Usenet is a strange place" - dmr 29 July 1999 |
| |
| | | Juha Nieminen |  |
| Posted: Thu Aug 21, 2008 2:56 pm Post subject: Re: Sorting a file with many already-sorted chunks |  |
ilya2@rcn.com wrote:
| Quote: | What is the best sorting algorithm if the file consists of many chunks of various sizes, each chunk is already ordered, but they are out of order?
|
Read all the sorted chunks (detecting where a chunk ends and a new starts is a trivial conditional) and then merge-sort all the chunks together. (In other words, merge pairs of chunks, then merge pairs of the resulting chunks, etc. until you have only one left.) |
| |
| | | stan |  |
| Posted: Thu Aug 21, 2008 2:56 pm Post subject: Re: Sorting a file with many already-sorted chunks |  |
| |  | |
Richard Heathfield wrote:
| Quote: | ilya2@rcn.com said:
What is the best sorting algorithm if the file consists of many chunks of various sizes, each chunk is already ordered, but they are out of order?
Expect improvements from others on this idea:
start a new group while the read of the next datum succeeds compare it to the previous datum (if there is one) in this group if it's greater, add it to this group otherwise, start a new group, and add this datum to the new group, which becomes the "current" group, the "this" group wend merge the groups in the canonical way
|
I think this application would require great care. The scan to find where a new sorted group starts and ends will most likely cost more than a simple straightforward sort. There are too many unknowns to give much advice. How large is the file? How large and how many sorted groups are there? Are any elements repeated?
This sure sounds like one of those cases where it could pay to make sure you are solving the right problem before jumping on what appears to be an obvious solution. On top of that this really seems like one of those times where it will be real easy to generate an O(n^2) solution when a standard library qsort or a standard mergesort would probably win without even considering any partial sort of the input.
How about more information about the input? |
| |
| | | Pascal J. Bourguignon |  |
| Posted: Thu Aug 21, 2008 4:32 pm Post subject: Re: Sorting a file with many already-sorted chunks |  |
| |  | |
Juha Nieminen <nospam@thanks.invalid> writes:
| Quote: | ilya2@rcn.com wrote: What is the best sorting algorithm if the file consists of many chunks of various sizes, each chunk is already ordered, but they are out of order?
Read all the sorted chunks (detecting where a chunk ends and a new starts is a trivial conditional) and then merge-sort all the chunks together. (In other words, merge pairs of chunks, then merge pairs of the resulting chunks, etc. until you have only one left.)
|
That's O(N^2).
You must do the merge of all chunks at once to reduce it to O(N).
As long as the number of chunks is not so big that at least one element from each chunk can be held in memory, it can be done in O(N).
Read the file once, keeping in memory the start and end position of each chunk, and the first element of each chunk.
Then scan all the chunks for the smallest element, and output it, advancing the chunk (read the next element, incrementing the current position for the chunk). Repeat until all chunks are empty.
-- __Pascal Bourguignon__ LINK
HANDLE WITH EXTREME CARE: This product contains minute electrically charged particles moving at velocities in excess of five hundred million miles per hour. |
| |
| | | Richard Harter |  |
| Posted: Thu Aug 21, 2008 5:16 pm Post subject: Re: Sorting a file with many already-sorted chunks |  |
| |  | |
On Thu, 21 Aug 2008 20:32:10 +0200, pjb@informatimago.com (Pascal J. Bourguignon) wrote:
| Quote: | Juha Nieminen <nospam@thanks.invalid> writes:
ilya2@rcn.com wrote: What is the best sorting algorithm if the file consists of many chunks of various sizes, each chunk is already ordered, but they are out of order?
Read all the sorted chunks (detecting where a chunk ends and a new starts is a trivial conditional) and then merge-sort all the chunks together. (In other words, merge pairs of chunks, then merge pairs of the resulting chunks, etc. until you have only one left.)
That's O(N^2).
|
No, its O(N*log2(M)) where M is the number of chunks. It's log2(M) passes with each pass being O(N).
| Quote: | You must do the merge of all chunks at once to reduce it to O(N).
|
It's still O(N*log2(M)). You need a data structure to hold headers for M chunks. Searching that data structure for the next item to merge in is O(log2(M)).
| Quote: | As long as the number of chunks is not so big that at least one element from each chunk can be held in memory, it can be done in O(N).
Read the file once, keeping in memory the start and end position of each chunk, and the first element of each chunk.
Then scan all the chunks for the smallest element, and output it, advancing the chunk (read the next element, incrementing the current position for the chunk). Repeat until all chunks are empty.
|
Richard Harter, cri@tiac.net LINK LINK Save the Earth now!! It's the only planet with chocolate. |
| |
| | | Richard Harter |  |
| Posted: Thu Aug 21, 2008 5:23 pm Post subject: Re: Sorting a file with many already-sorted chunks |  |
On Thu, 21 Aug 2008 16:13:56 GMT, Juha Nieminen <nospam@thanks.invalid> wrote:
| Quote: | ilya2@rcn.com wrote: What is the best sorting algorithm if the file consists of many chunks of various sizes, each chunk is already ordered, but they are out of order?
Read all the sorted chunks (detecting where a chunk ends and a new starts is a trivial conditional) and then merge-sort all the chunks together. (In other words, merge pairs of chunks, then merge pairs of the resulting chunks, etc. until you have only one left.)
|
If the chunks are of significantly unequal size it is more efficient to sort the chunks by size and then successively merge the two smallest. The principle is the same as in Huffman encoding.
Richard Harter, cri@tiac.net LINK LINK Save the Earth now!! It's the only planet with chocolate. |
| |
| | | Richard Heathfield |  |
| Posted: Thu Aug 21, 2008 5:45 pm Post subject: Re: Sorting a file with many already-sorted chunks |  |
| |  | |
James said:
| Quote: | On Aug 21, 12:13 pm, Juha Nieminen <nos...@thanks.invalid> wrote: il...@rcn.com wrote: What is the best sorting algorithm if the file consists of many chunks of various sizes, each chunk is already ordered, but they are out of order?
Read all the sorted chunks (detecting where a chunk ends and a new starts is a trivial conditional) and then merge-sort all the chunks together. (In other words, merge pairs of chunks, then merge pairs of the resulting chunks, etc. until you have only one left.)
This works well for the OP's example data, but consider the data...
1 2 5 6 9 3 4
Splitting this data into the minimum number of chunks does not seem so easy.
|
It's trivial, as both Juha and I have explained.
| Quote: | We can split it into 3 chunks (1,2) (3,4) (5,6,9), but on a single pass?
|
Yes, you can split it into three chunks just like that, in a single pass.
1 2 5 6 9 3 4
Then you merge the chunks:
1 2 5 6 9 3 4
Comparing 1, 5, 3, we take 1
2 5 6 9 3 4
(1)
Comparing 2, 5, 3, we take 2
5 6 9 3 4 (1, 2)
Comparing 5, 3, we take 3
5 6 9 4
(1, 2, 3)
Comparing 5, 4, we take 4
5 6 9
(1, 2, 3, 4)
Comparing 5 we take 5, and so on through the remainder of the single chunk.
-- Richard Heathfield <http://www.cpax.org.uk> Email: -http://www. +rjh@ Google users: <http://www.cpax.org.uk/prg/writings/googly.php> "Usenet is a strange place" - dmr 29 July 1999 |
| |
| | | Guest |  |
| Posted: Thu Aug 21, 2008 5:54 pm Post subject: Re: Sorting a file with many already-sorted chunks |  |
On Aug 21, 12:13 pm, Juha Nieminen <nos...@thanks.invalid> wrote:
| Quote: | il...@rcn.com wrote: What is the best sorting algorithm if the file consists of many chunks of various sizes, each chunk is already ordered, but they are out of order?
Read all the sorted chunks (detecting where a chunk ends and a new starts is a trivial conditional) and then merge-sort all the chunks together. (In other words, merge pairs of chunks, then merge pairs of the resulting chunks, etc. until you have only one left.)
|
Thanks -- it worked quite well. |
| |
| | | Richard Heathfield |  |
| Posted: Thu Aug 21, 2008 6:03 pm Post subject: Re: Sorting a file with many already-sorted chunks |  |
| |  | |
James said:
| Quote: | On Aug 21, 3:45 pm, Richard Heathfield <r...@see.sig.invalid> wrote:
Yes, you can split it into three chunks just like that, in a single pass.
1 2 5 6 9 3 4
I'm a rank amature in computational algorithms (I'm an engineer) but how are you splitting the data into those 3 chunks instead of one of the following?
1 2 5 6 9 3 4
|
Oops, good spot. Yes, the algorithm I described would indeed split it as 1, 2, 5, 6, 9 and 3, 4, not as three chunks. Sorry, just a braino there.
Which is fine. 1 2 5 6 9 3 4
gives you a very quick merge:
2 5 6 9 3 4 (1)
5 6 9 3 4 (1 2)
5 6 9 4 (1 2 3)
5 6 9
(1 2 3 4)
followed by a bolt-on move:
1 2 3 4 5 6 9
That's one pass (i.e. O(n)) to get your chunks, and a complete sort in only four comparisons, for nine items. I'd call that pretty darn good. :-)
-- Richard Heathfield <http://www.cpax.org.uk> Email: -http://www. +rjh@ Google users: <http://www.cpax.org.uk/prg/writings/googly.php> "Usenet is a strange place" - dmr 29 July 1999 |
| |
| Page 1 of 5 .:. Goto page 1, 2, 3, 4, 5 Next | |
|
|