Skip to content

Recent Articles

24
Jun

Fighting the memory limits, part 2

Display optimization

No doubt that for you the most important thing is the speed of displaying the website’s  content. The slower it works, more problems you’ll experience such as server overload and people complaining the website opens to slowly. We know it and this is why we put so much effort to increase the speed of display.

We want to affect your website as little as possible. The changes we made recently are a big step in that direction. The main change in the new script is some sort of prediction and optimisation. We try to find out if we need to make any changes in the database. If not, the script will avoid keeping in memory any data that is unnecessary during display command. With that we can stop reading the file sooner. It means that in the best case scenario we’ll be able to display our ads in with almost no overhead.

Unfortunately, when there is need to change some information in the file, the script still needs to rewrite the whole file and memory usage might be significant.

Here, on the plot, you can see the difference between the old and the new script version.

The super pessimistic variant is similar chart you can find in the next section of this post. As a summary for all of you, it means that more of your ads will be visible and we won’t affect your websites. On every chart you might notice that for huge amount of data there is no information about memory usage and execution time for the old script. This is because request died with the exceeded memory limit error.

Data push optimisation

Some of you might probably think that the problem is ours and not yours. It’s not true. First of all, during data push commands the script is locking exclusively the data file, and it means, that during push time every other query on your website has to wait.  We never connect to your website more than we need, and when we need it, it’s  not more often than every 2 hours but even then your clients might complain about your website’s speed when our operation takes too long time.  Another thing is that if we can’t push data, some of your links might not be visible.

In this area we’ve also gained some speed and lowered memory footprint.

After changes the is possible to push more data to your website. We have increased the stability of our script. Also, we’ve found some natural limitations in the number of pages that the old version of the script could work with. After some considerations we’ve decided to limit the number of pages to 500 (and it is possible to have up to 4000 pages) for every website that will use this script. In the end It’s not so bad because most websites have no more than 500 pages. If there is more, then it’s usually a portal, forum or website with duplicate content. For the whole system it’s better when ads are distributed across many different websites. Soon we’ll come up with a special script for websites with greater number of pages. So if your website is really big, and content you are providing is unique and you want to have more points it’ll be possible to get them so don’t be worried!

Let’s get back to the current solution. As I mentioned earlier, we have implemented some prediction mechanisms for data management. Just before database reading script tries to find the best options for the request to minimise execution time and memory footprint, it might decide to avoid the creation some special memory structures if they’re not needed for the current request.

New functionality and the crawler is coming

We have implemented some special commands that will be necessary when the crawler starts doing some work instead of your script. From the new versions we can disable page gathering in the script and we can clean up and shrink your database. Within the next few months we’re planning to enable the crawler for all websites. Then  the system will clean your databases and disable file rewriting during display time.

Summary

For sure, we did a lot but not everything we could. With the new php version there will be a possibility of using other data structures than array, but for now and taking into account the technology we’re using, I’d say that we did as much work as we could and we got some very nice results. It’s certain that we’ll come up with some more optimisations in the future. As an example I can say that once we start using the crawler for all of you, we’ll remove all unused code from the script and some locking mechanisms. With this another bottleneck will be removed.

Robert.

  1. Display optimization

 

No doubt that for you the most important thing is the speed of displaying the website’s content. The slower it works, more problems you’ll experience such as server overload and people complaining the website opens to slowly. We know it and this is why we put so much effort t o increase the speed of display. We want to affect your website as little as possible. The changes we made recently are a big step in that direction. The main change in the new script is some sort of prediction and optimisation. We try to find out if we need to make any changes in the database. If not, the script will avoid keeping in memory any data that is unnecessary during display command. With that we can stop reading the file sooner. It means that in the best case scenario we’ll be able to display our ads in with almost no overhead.

 

Unfortunately, when there is need to change some information in the file, the script still needs to rewrite the whole file and memory usage might be significant.

 

Here, in the diagram (on the plots) you can see the difference between the old and the new script version. The super pessimistic variant is similar chart you can find in the next section of this post. As a summary for all of you, it means that more of your ads will be visible and we won’t affect your websites. On every chart you might notice that for hu ge amount of data there is no information about memory usage and execution time for the old script. This is because request died with the exceeded memory limit error.

 

  1. Data push optimisation.

 

Some of you might probably think that the problem is ours and not yours. It’s not true. First of all, during data push commands the script is locking exclusively the data file, and it means, that during push time every other query on your website has to wait. We never connect to your website more than we need, and when we need it, it’s not more often than every 2 hours but even then your clients might complain about your website’s speed when our operation takes too long time. Another thing is that if we can’t push data, some of your links might not be visible.

 

In this area we’ve also gained some speed and lowered memory footprint. After changes the is possible to push more data to your website. We have increased the stability of our script. Also, we’ve found some natural limitations in the number of pages that the old version of the script could work with. After some considerations we’ve decided to limit the number of pages to 500 (and it is possible to have up to 4000 pages) for every website that will use this script. In the end It’s not so bad because most websites have no more than 500 pages. If there is more, then it’s usually a portal, forum or website with duplicate content. For the whole system it’s better when ads are distributed across many different websites. Soon we’ll come up with a special script for websites with greater number of pages. So if your website is really big, and content you are providing is unique and you want to have mo re points it’ll be possible to get them so don’t be worried!

 

Let’s get back to the current solution. As I mentioned earlier, we have implemented some prediction mechanisms for data management. Just before database reading script tries to find the best options for the request to minimise execution time and memory footprint, it might decide to avoid the creation some special memory structures if they’re not needed for the current request.

 

  1. New functionality and the crawler is coming

 

We have implemented some special commands that will be necessary when the crawler starts doing some work instead of your script. From the new versions we can disable page gathering in the script and we can clean up and shrink your database. Within the next few months we’re planning to enable the crawler for all websites. Then the system will clean your databases and disable file rewriting during display time.

 

  1. Summary

 

For sure, we did a lot but not everything we could. With the new php version there will be a possibility of using other data structures than array, but for now and taking into account the technology we’re using, I’d say that we did as much work as we could and we got some very nice results. It’s certain that we’ll come up with some more optimisations in the future. As an example I can say that once we start using the crawler for all of you, we’ll remove all unused code from the script and some locking mechanisms. With this another bottleneck will be removed.

 

 

Best Regards,

Robert Bocheński

1. Display optimization

No doubt that for you the most important thing is the speed of displaying the website’s  content. The slower it works, more problems you’ll experience such as server overload and people complaining the website opens to slowly. We know it and this is why we put so much effort to increase the speed of display. We want to affect your website as little as possible. The changes we made recently are a big step in that direction. The main change in the new script is some sort of prediction and optimisation. We try to find out if we need to make any changes in the database. If not, the script will avoid keeping in memory any data that is unnecessary during display command. With that we can stop reading the file sooner. It means that in the best case scenario we’ll be able to display our ads in with almost no overhead.

Unfortunately, when there is need to change some information in the file, the script still needs to rewrite the whole file and memory usage might be significant.

Here, in the diagram (on the plots) you can see the difference between the old and the new script version. The super pessimistic variant is similar chart you can find in the next section of this post. As a summary for all of you, it means that more of your ads will be visible and we won’t affect your websites. On every chart you might notice that for huge amount of data there is no information about memory usage and execution time for the old script. This is because request died with the exceeded memory limit error.

2. Data push optimisation.

Some of you might probably think that the problem is ours and not yours. It’s not true. First of all, during data push commands the script is locking exclusively the data file, and it means, that during push time every other query on your website has to wait.  We never connect to your website more than we need, and when we need it, it’s  not more often than every 2 hours but even then your clients might complain about your website’s speed when our operation takes too long time.  Another thing is that if we can’t push data, some of your links might not be visible.

In this area we’ve also gained some speed and lowered memory footprint. After changes the is possible to push more data to your website. We have increased the stability of our script. Also, we’ve found some natural limitations in the number of pages that the old version of the script could work with. After some considerations we’ve decided to limit the number of pages to 500 (and it is possible to have up to 4000 pages) for every website that will use this script. In the end It’s not so bad because most websites have no more than 500 pages. If there is more, then it’s usually a portal, forum or website with duplicate content. For the whole system it’s better when ads are distributed across many different websites. Soon we’ll come up with a special script for websites with greater number of pages. So if your website is really big, and content you are providing is unique and you want to have more points it’ll be possible to get them so don’t be worried!

Let’s get back to the current solution. As I mentioned earlier, we have implemented some prediction mechanisms for data management. Just before database reading script tries to find the best options for the request to minimise execution time and memory footprint, it might decide to avoid the creation some special memory structures if they’re not needed for the current request.

3. New functionality and the crawler is coming

We have implemented some special commands that will be necessary when the crawler starts doing some work instead of your script. From the new versions we can disable page gathering in the script and we can clean up and shrink your database. Within the next few months we’re planning to enable the crawler for all websites. Then  the system will clean your databases and disable file rewriting during display time.

4. Summary

For sure, we did a lot but not everything we could. With the new php version there will be a possibility of using other data structures than array, but for now and taking into account the technology we’re using, I’d say that we did as much work as we could and we got some very nice results. It’s certain that we’ll come up with some more optimisations in the future. As an example I can say that once we start using the crawler for all of you, we’ll remove all unused code from the script and some locking mechanisms. With this another bottleneck will be removed.

Best Regards,
Robert Bocheński

27
May

Fighting the memory limits, part 1

Our system is constantly growing and changing. Every day we see new possibilities for modification and, of course, new bugs. You might have noticed that on some sites a portion of our links doesn’t display correctly. There might be several reasons why they are not showing:

  • we haven’t published your ads yet,
  • the website’s owner removed our code and the website is pending removal,
  • the page’s cache hasn’t refreshed yet,
  • we couldn’t push data with our links.

The first three issues are acceptable if the situation will change soon (we push data, cache refreshes after while, the installation has been fixed or the site has been removed from the system). Unfortunately, the last point could be caused by errors that will never be solved, and all the time some of your links will stay invisible in the network. The main reason for such a situation is the memory limit connected with every PHP request. (Standard limit for PHP 5.2 and older is 8Mb for the rest 16Mb).

When we were designing our platform, the most important assumption was that our script has to work on most commonly used web hosting and work seamlessly with most regular websites. This is why we dropped algorithms that require other libraries to function (cURL, MySQL, PostgreSQL, SQLite, fsockopen). They have some downsides. On page rendering the website asks for data and would need to connect to our network (so we’d need to have cURL or socket functions enabled). Some caching system would be necessary in order not to slow the services down.

We didn’t throw away this idea completely, though. Soon we’ll start work on this solution and owners of bigger sites that want to get more volts will need to install the other version of the script.

With the current version of the script, because of the restrictions we have, it’s not possible to use common data management methods (such as databases). Therefore all the data we need are stored in a simple text database and this causes our main problem. When your text database is growing, the execution time is growing too. We can’t build a highly optimized database because of the common lack of multi-process mechanisms such as file locking (often disabled). Because of that, every time we need to read some data from the file we need to read almost the whole file. When we need to write something, we need to put all the data into memory, do changes and rewrite the whole file. Believe me, your servers don’t like these things. Even though we’re still trying to optimize the execution to shorten the time for file locking, etc. For the most common services this script is good enough. As always, though, it can be better. Below I described what we’ve changed in the newest version of the script and what we got instead.

Rebuilding the file

First thought that came to our minds was “how much data is stored in the biggest sites?”, and we found four representative data sets for the rest of the testing work. During data exploration we realized that there’s a high data redundancy and some meaningless information. Armed with this knowledge we’ve changed the structure of our file. The first thing we did was changing every URL into a hash (the only exception is the information about new pages). Additionally, we moved the information about ad words into a separate place and now we connect to ads information by reference. In the end we stopped URL encoding data. On changing these we ran some tests and you can see the difference in the table. Given data sets will appear on every plot in this article.

File sizes

Old sizeNew sizeRatio
Initial2942941
Small1350756742,3805
Medium8416333276752,5685
Large12334106794121,8154
Huge976721112938797,5488

Data sets explanation:

  • initial — almost empty, just initialised file
  • small — small blog, 16 pages every page with 4 ads. No info about new or old pages
  • medium — bigger blog. About 1040 pages with ads connected, 200 old pages, 200 new pages.
  • large — 600 pages with ads, 16000 known pages.
  • huge — 20k with ad words, no info about known pages, no info about new pages.

As you can see from the table, the size of the tested files dropped by up to 8 times. We believe can expect even better results.

Memory usage optimisation

PHP language, even though it has a lot advantages, might not be very good for some things, especially because its memory management or, to be more precise, for its lack of other data structures. The most common data structure that exists in every PHP version is an array (SPL is an addition in 5.x versions). Single array instance costs about 544 bytes. A very simple test could show us that adding to such an object 10 integers will cost around 2kB. It means that for every 32bit integer we need around 200 bytes. In the end, if you want to put 500 integers into an array you have to allocate more or less 72296 bytes. It’s a huge amount of memory for only 500 integers. Overhead is about 70000 bytes (500 integers * 4byte = 2000bytes), and we could say it is normal for hash tables. You could imagine what will happen when dealing with multidimensional nested arrays… well 8Mb intended for PHP request could be exhausted very quickly. In the table below you can see the results from our simple tests.

Results data

Items deltaMemory delta
Empty array544
0680
101520
101536
101408
101664
101432
101408
101920
101408
202768
101408
203792
101408
202768
101408
202768
101408
202768
101408
202768
103456
101408
101424
101408
202768
101408
202768
101408
202768
101408
202768
101408
202768
101408
202768
101408
202768
101408
Total72296
Avg.2013,56

Following this line we wanted to check if it’s possible to lower the memory usage by flattening the data structures and this turned out to be what we were looking for. We’ve tested some methods of flattening data sets and the results we got you can see in the table below. The numbers are talking for themselves. After the test we decided to change the method of storing data in memory, and now we’ll try to keep only serialised data structures, and when necessary, we just do a simple de-serialisation, update and serialisation. As a side effect we have some speed-up as we don’t need to unserialise nor serialise data in most cases.

Data structures:Memory usage:Ratio
Array of Array134687361,000
Array of serialized array43893363,069
Array of JSONed array35228083,823

We hesitated whether it would be a good idea to use JSON, as it was added only in versions PHP 5.2, but in the end we decided to detect the version and use it if possible. I’ll describe the effects we got and the explanation of some other changes in the second part of this post.

 

Robert.

File sizes

Old sizeNew sizeRatio
Initial2942941
Small1350756742,3805
Medium8416333276752,5685
Large12334106794121,8154
Huge976721112938797,5488
16
Mar

New publication algorithm

A long, long time ago, or maybe not that long but still quite some time ago, we decided to come up with a good algorithm for link distribution. In real life, though, it turned out to be a little too good but at the same time not that good at all.

The early days

The goal of the algorithm was to distribute based on priorities, some of you might remember that there were links and ads priority… But how were we to calculate how many links are to be distributed and how to avoid hurting any of them in the ever-changing environment (day limits, total limits, page limits, site limits and no more slots)?

Here’s a short explanation of how the original algorithm worked.

Based on priorities we calculated the publications target. For this explanation we’lll use the following notations:

Adword{priority,publications}

Where:

Adword is an ad word id
priority is calculated from LinkPriority * AdPriority
publications is the amount of publications starting from last change (adding new link, new ad, deleting ad or link, changing priority of ad or link)

Let’s make some assumptions. We have four ads:

A{2, 0}
B{2, 0}
C{4, 0}
D{6, 0}

Before the publication process the algorithm calculated the publication target:

base = Priority(a) + Priority(b) +
Priority(c) + Priority(d) = 14

weight = Priority(AD) / base

Wa = 2 / 14 = 0,142857142857143
Wb = 2 / 14 = 0,142857142857143
Wc = 4 / 14 = 0,285714285714286
Wd = 6 / 14 = 0,428571428571429

Now, there’s just one little problem, how do we calculate target values? We assumed that every time the algorithm starts, we’ll give the user some slots to use, ie:

slotsGiven = 1000

Here is how we found how many times an ad word is to be published (totally). Here we have the sum of the current publications:

TotalPubsBase = SUM(Publications(Ad))

And here is the value of publications we want to have at the end of algorithm:

TotalPubsTarget = TotalPubsBase + slotsGiven

And here’s our example again :

TotalPubsBase = 0
TotalTarget = TotalPubsBase + SlotsGiven = 1000

In the next step we calculated the publication target for each adword:

foreach AD:
Target(ad) = MAX(1, Wad * TotalPubsTarget)

We had a minimum value to avoid stopping publication early.

Based on our example we have:

Target(a) = Wa * 1000 = 142
Target(b) = Wb * 1000 = 142
Target(c) = Wc * 1000 = 285
Target(d) = Wd * 1000 = 428

In the next step we sorted ads with the largest underweight:

Underweight(ad) = Target(ad) - Publications(ad)

What gave us an order like this:

D, C, B, A

At last, the publication process would start.

Even if our algorithm was good in theory, in practice it didn’t work that well. The main issue was slotsGiven value. We were calculating this value as a ratio based on the number of pages with at least one free slot and the amount of unique links. To be sure that publication wouldn’t stop we had a minimum value of 1000 for every user. There was also an additional limit per user to provide a fair situation for everyone. However, with a small number of slotGiven and a large number of ad words we would often get a small amount of slots for publication.

The new algorithm

The real life proved that the theory is not always easy to be put into practice. Here’s what could happen and sometimes did happen. If ad word A reached day or total limit early in the process, the user could lose a lot of given slots (for example A had aim for 428 slots but stopped on 2 because of the day limit) it meant the the user was not happy and the system was not fair. Additionally, to keep the priorities we tried to publish the first ad word as many times as we could before publishing another and so on. With this, the user would sometimes spend all of their points on one ad and this one ad would be distributed across one site.

Our recent update of Manage Ads changed not only the user interface but also the algorithm itself. Now it’s much simpler, more natural, there are no priorities anymore and new natural limitations came along.

Here’s how the new algorithm works:

userAdsBucket = (A, B, C, D)

while userAdsBucket not empty:
{ad} = get next ad from {bucket}
find slot for {ad}

if no more slots or limit reached:
remove {ad} from {bucket}
continue algorithm
end if

... do some other stuff
end while

With this algorithm your ads are published in the following order:

A, B, C, D, A, B, C, D, etc.

With this your Volts are spent equally on every ad, ads are published on different websites and different ads (with the same link) might be published on the same website.

Robert.

A long, long time ago, or maybe not that long but still quite some time ago, we decided to come up with a good algorithm for link distribution. In real life, though, it turned out to be a little too good but at the same time not that good at all.

The early days

The goal of the algorithm was to distribution based on priorities, some of you might remember that there were links and ads priority… But how were we to calculate how many links are to be distributed and how to avoid hurting any of them in the ever-changing environment (day limits, total limits, page limits, site limits and no more slots)?

Here’s a short explanation of how the original algorithm worked.

Based on priorities we calculated the publications target. For this explanation we’lll use the following notations:

Adword{priority,publications}

Where:

  • Adword is an ad word id
  • priority is calculated from LinkPriority * AdPriority
  • publications is the amount of publications starting from last change (adding new link, new ad, deleting ad or link, changing priority of ad or link)

Let’s make some assumptions. We have four ads:

A{2, 0}
B{2, 0}
C{4, 0}
D{6, 0}

Before the publication process the algorithm calculated the publication target:

base = Priority(a) + Priority(b) +
       Priority(c) + Priority(d) = 14

weight = Priority(AD) / base

Wa = 2 / 14 = 0,142857142857143
Wb = 2 / 14 = 0,142857142857143
Wc = 4 / 14 = 0,285714285714286
Wd = 6 / 14 = 0,428571428571429

Now, there’s just one little problem, how do we calculate target values? We assumed that every time the algorithm starts, we’ll give the user some slots to use, ie:

slotsGiven = 1000

Here is how we found how many times an ad word is to be published (totally). Here we have the sum of the current publications:

TotalPubsBase = SUM(Publications(Ad))

And here is the value of publications we want to have at the end of algorithm:

TotalPubsTarget = TotalPubsBase + slotsGiven

And here’s our example again :

TotalPubsBase = 0
TotalTarget = TotalPubsBase + SlotsGiven = 1000

In the next step we calculated the publication target for each adword:

foreach AD:
    Target(ad) = MAX(1, Wad * TotalPubsTarget)

We had a minimum value to avoid stopping publication early.

Based on our example we have:

Target(a) = Wa * 1000 = 142
Target(b) = Wb * 1000 = 142
Target(c) = Wc * 1000 = 285
Target(d) = Wd * 1000 = 428

In the next step we sorted ads with the largest underweight:

Underweight(ad) = Target(ad) - Publications(ad)

What gave us an order like this:

D, C, B, A

At last, the publication process would start.

Even if our algorithm was good in theory, in practice it didn’t work that well. The main issue was slotsGiven value. We were calculating this value as a ratio based on the number of pages with at least one free slot and the amount of unique links. To be sure that publication wouldn’t stop we had a minimum value of 1000 for every user. There was also an additional limit per user to provide a fair situation for everyone. However, with a small number of slotGiven and a large number of ad words we would often get a small amount of slots for publication.

The new algorithm

The real life proved that the theory is not always easy to be put into practice. Here’s what could happen and sometimes did happen. If ad word A reached day or total limit early in the process, the user could lose a lot of given slots (for example A had aim for 428 slots but stopped on 2 because of the day limit) it meant the the user was not happy and the system was not fair. Additionally, to keep the priorities we tried to publish the first ad word as many times as we could before publishing another and so on. With this, the user would sometimes spend all of their points on one ad and this one ad would be distributed across one site.

Our recent update of Manage Ads changed not only the user interface but also the algorithm itself. Now it’s much simpler, more natural, there are no priorities anymore and new natural limitations came along.

Here’s how the new algorithm works:

userAdsBucket = (A, B, C, D)

while userAdsBucket not empty:
    {ad} = get next ad from {bucket}
    find slot for {ad}

    if no more slots or limit reached:
        remove {ad} from {bucket}
        continue algorithm
    end if

    ... do some other stuff
end while

With this algorithm your ads are published in the following order:

A, B, C, D, A, B, C, D, etc.

With this your Volts are spent equally on every ad, ads are published on different websites and different ads (with the same link) might be published on the same website.

Robert.

8
Mar

We’re working on a spider

Hello all,

Right now we’re working on a new crawling mechanism. It’ll help solve a lot of problems that would otherwise arise as the system grows bigger and bigger. We want to keep things simple and the script is no different. We’re not quite happy with how complex it is now but our current plan is to simplify.

Gathering information about websites comes first. Now, the script itself gathers information about pages that have been added to the system. The by-product of it is that lots of redundant info is stored in our text base. Not for long, though. This means your websites will work faster and we’ll be able to fine tune the processes. What’s more, in the future our software will use information from robots.txt to optimize its work. Also, we’ll use robots to change the information about the number of parallel requests, pauses between requests etc.

A more stable distribution process will be another benefit. At the moment, we often deal with a typical race condition and parallel processes may overwrite the information that we’re trying to save. Why does it happen? Some providers turn off locking functions, and there’s no simple method to avoid multiple request performing the same actions like writing to the same file. On the other hand, where locking works fine you’ll see increase in speed as we’ll avoid bottlenecks.

The crawler will also check if your links are displayed correctly. Sites with numerous errors will then be moderated.

Our crawler will be based on Scrapy framework. Scrapy crawler is written in Python and it’ll take little time for us to get into grips with the new language. I’m sure it’ll be fun. Eventually, we want the crawler to become a separate mechanism that might work on any server and will be distributing crawlers through many different servers in cloud systems.

Robert.

8
Mar

So where do I press… oh, here! Boom…

Welcome to VoltRank blog!

It’s a new place where we’re going to tell you what’s up with us.

How is this place different from the already existing VoltRank channels like Facebook and Twitter? We won’t let our marketing guys come near it! We hand it over to our development team which means that it’s going to be technical.

We’ll let you know what are we working on in our labs, share some useful tips and let you peek behind the curtain. There’s a lot more going on than you can see on the surface. We know that a lot of you are more than just webmasters. Some may be even bigger geeks than we are. If so, this is the place for you!