oh, the dangers of over-optimizing a small bit of code Leave a comment

I use the (excellent) Python library dogpile.cache quite extensively. One web project can make hundreds of calls to the cache per page, so its often one of the first places I look to maximize performance.

I have one bit of code that uses the key_mangler functionality manipulate the keys used for a Redis data store.

An overly simplified example of using this is generating the key_mangler to prefix keys, and looks something like this:

LIB_1 = {}for i in 'abcdefg':    def func(_key):        tmpl = "%s|%%s" % i        return tmpl % _key    LIB_1[i] = func

If we iterate them, you can see the output

for i, func in LIB_1.items():    print(i, func("Hello."))a a|Hello.c c|Hello.b b|Hello.e e|Hello.d d|Hello.g g|Hello.f f|Hello.

While speeding this up a bit, I lazily moved the tmpl variable out of the block and into this general form:

LIB_2 = {}for i in 'abcdefg':    tmpl = "%s|%%s" % i    def func(_key):        return tmpl % _key    LIB_2[i] = func

The result was this had quite the speedup twice the speed of the original version. But there was a huge caveat all my unit tests broke. Digging into things, the following is the output of iterating over the same info:

for i, func in LIB_2.items():    print(i, func("Hello."))a g|Hello.c g|Hello.b g|Hello.e g|Hello.d g|Hello.g g|Hello.f g|Hello.

I was lazy when i rewrote the function, and did not expect Python to scope tmpl like that. It makes perfect sense, I know Python behaves like this, and Ive taken this exact implementation detail into account before I just didnt expect it at the time.

Since I knew how/why this happened, the fix was simple hinting the def with scope I want

LIB_3 = {}for i in 'abcdefg':    tmpl = "%s|%%s" % i    def func(_key, tmpl=tmpl):        return tmpl % _key    LIB_3[i] = func

This 3rd variant runs about the same speed as the second variant which is around twice the speed of the original method, where the tmpl is assigned on each invocation.

for i, func in LIB_3.items():    print(i, func("Hello."))a a|Hello.c c|Hello.b b|Hello.e e|Hello.d d|Hello.g g|Hello.f f|Hello.
2018 US Olympic Medals, Adjusted 2 Leave a comment

Last updated: 2018-02-23. The score is 15 to 6

Americans have become increasingly more racist and intolerant of their fellow citizens in the past few years. Rhetoric and policies by Donald Trump, Mike Pence and various leaders of the Republic Party have sought to treat people who are not Straight, White and Christian as second class citizens.

This series will be updated throughout the olympics to track separate US Olympic Medal counts for Straight White Christian America and citizens who are routinely persecuted for their ethnicity, religion or sexuality.

Medal Counts may be adjusted as more background information on athletes becomes available.

Team Straight White Christian America

Total: 15

GoldeventcategoryathleteAlpine Skiing: Giant SlalomWomensMikaela ShiffrinCross Country: Team SprintWomensKikkan Randall, Jessie DigginsFreestyle Skiing: HalfpipeMensDavid WiseHockey: TourenmentWomensUS TeamSnowboarding: SlopestyleWomensJamie AndersonSnowboarding: SlopestyleMensRedmond GerardSnowboarding: HalfpipeMensShaun WhiteSilvereventcategoryathleteAlpine Skiing: CombinedWomensMikaela ShiffrinFreestyle Skiing: SlopestyleMensNick GoepperFreestyle Skiing: HalfpipeMensAlex Ferreirapotential $$Luge: SinglesMensChris MazdzerShort Track: 1000mMensJohn-Henry KruegerSnowboarding: Big AirWomensJamie AndersonBronzeeventcategoryathleteAlpine Skiing: DownhillWomensLindsey VonnFreestyle Skiing: HalfpipeWomensBrita SigourneyTeam Routinely Persecuted America

Total: 6

GoldeventcategoryathleteSnowboarding: HalfpipeWomensChloe Kim,$SilvereventcategoryathleteBobsledWomensLauren Gibbs, Elana Meyers TaylorBronzeeventcategoryathleteFigure SkatingTeamNathan Chen,$, Alexa Scimeca Knierim, Chris Knierim, Mirai Nagasu,$;, Adam Rippon, Alex Shibutani, Maia Shibutani, Bradie TennellFigure Skating: Ice DanceTeamAlex Shibutani, Maia ShibutaniSnowboarding: HalfpipeWomensArielle GoldSpeed Skating: Team PursuitWomensHeather Bergma, Brittany Bowe, Mia Manganello, Carlijn Schoutens$$Global Standings

2018.02.23

A unified Team America is in 4th place overall with 21 medals. Without Routinely Persecuted Americans, White Christian America would drop to 15 medals – a tied 5th place (with France), behind the Netherlands.

RankCountryAll Medals1Norway372Canada273Germany264USA215Netherlands186France157Russia14The disqualifiersKeymarkdisqualifiernot white$immigrant$$child of immigrant(s)lgbtqnot christianNotes ResearchedAdam Rippon LGBTQ icon Alex Shibutani Asian, parents of Chinese descentArielle Gold JewishChloe Kim Asian, parents immigrated from South Korea Maia Shibutani Asian, parents of Chinese descentMirai Nagasu Asian, parents are Japanese citizensNathan Chen Asian, parents immigrated from ChinaCarlijn Schoutens Born in NJ to Dutch parents, raised in the Netherlands, moved back to USA to competeNotes 2 Needs more researchAlex Ferreira Alexs father played soccer for the River Plate club in Buenos Aries Argentina, however his birthplace has not yet been determined.
a security vulnerability affecting most websites and open source projects, even facebook Leave a comment

While auditing some new changes to an Open Source project I often use, I wrote a test-suite to surface edge cases and discovered something I never noticed before: the ASCII Control Characters made it safely across the entire stack as-is. It didnt raise any alarms or trigger escaping from Form controls, User Generated Content Sanitizers, or HTML Safe Markup rendering.

Form control systems generally only trigger an error with ASCII controls if the input validator is built with a Regular Expression that allows specific characters/ranges. Most just check to see valid ASCII characters. ASCII control characters do a lot of useful things like providing newlines and tabs. Most are irrelevant, but one is particularly problematic the backspace character. Backspaces are problematic because Web Browsers, Text Editors and Developer IDEs typically consider most ASCII control characters to be non-printing or invisible and dont render them onscreen, but they are still part of the document. If you copy a document or document part to your clipboard, the invisible characters copy too. If you then paste them into an interpreter window, they will be interpreted/execute in real-time allowing a malicious user to carefully construct an evil payload.

An Example

Consider this trivial example with Python. It looks like were loading the yosemite library, right?:

import yosemite

Wrong. Pasted into a Python interpreter, this will appear:

import os

How? There are ASCII backspace control characters after certain letters. When pasted into an interpreter, they immediately activate. Using the escape sequences, the actual string above looks like:

import y\bose\bm\bi\bt\be

If you want to try this at home, you can generate a payload with this line:

open('payload.txt', 'w').write("import y\bose\bm\bi\bt\be ")

Given a few lines of text and some creativity, it is not hard to construct hidden code that deletes entire drives or does other nefarious activities.

Its an Edge Case

This really only affects situations meeting both of the following two requirements:

People retrieve code published via User Generated Content (or potentially main site articles)The code is copy/pasted directly into an interpreter or terminal (Ive test this against Python, Ruby, and Node.JS)

The target audiences for this type of vulnerability is developers and technology organizations. The potential vectors for bad payloads are:

posts / articles / statuspastebinsbug-tracking / ticketing softwarehow to reproduce instructions on security vulnerability report forms (ironic, i know)What is and isnt affected

I tested dozens of internet scale websites, only two filtered out the backspace control character: StackOverflow and Twitter. WordPress.com filters out the backspace, but WordPress open source posts do not.

I focused my search on the types of sites that people would copy/paste code from: coding blogs, engineering team notes, discussion software, bugtrackers, etc. Everything but the two sites mentioned above failed.

Bad payloads were also accepted in:

FacebookMultiple Pastebin and fiddle services

Most Python/Ruby/PHP libraries allowed bad payloads through Form validation or UGC sanitization.

Did I file reports with anyone?

I previously filed security reports with a handful of projects and websites. I also offered patches to escape the backspace issue with some projects as well.

I decided to post this publicly because:

Open Source projects overwhelmingly believed the issue should be addressed by another component.Commercial Websites believed this is a social engineering hack and dismissed it.
Protected: Improving the performance of a Python web application
Small gotcha under Pythons PYTHONOPTIMIZE feature Leave a comment

A long, long, long time ago I started using interpreter optimizations to help organize code. If a block of code is within a constant (or expression of constants) that evaluate to False, then the block (or line) is optimized away.

if (false){ # this is optimized away }

Perl was very generous, and would allow for constants, true/false, and operations between the two to work.

Thankfully, Python has some optimizations available via the PYTHONOPTIMIZE environment variable (which can also be adjusted via the -o and -oo flags). They control the __debug__ constant, which is True (by default) or False (when optimizations are enabled), and will omit documentation strings if the level is 2 (or oo) or higher. Using these flags in a production environment allows for very verbose documentation and extra debugging logic in a development environment.

Unfortunately, I implemented this incorrectly in some places. To fine-tune some development debugging routines I had some lines like this:

if __debug__ and DEBUG_CACHE:    pass

Pythons interpreter doesnt optimize that away if DEBUG_CACHE is equal to False, or even if it IS False (i tested under 2.7 and 3.5). I should have realized this (or at least tested for it). I didnt notice this until I profiled my app and saw a bunch of additional statistics logging that should not have been compiled.

The correct way to write the above is:

if __debug__:    if DEBUG_CACHE:        pass

Here is a quick test

trying running it with optimizations turned on:

export PYTHONOPTIMIZE=2python test.py

and with them off

export PYTHONOPTIMIZEpython test.py

As far as the interpreter is concerned during the optimization phase, __debug__(False) and False is True.

import disdef foo():    """docstring"""    if __debug__:        print("debug message")    return Truedef bar():    """docstring"""    if __debug__ and False:        print("debug message")    return True# ============# this will be a lean functiondis.dis(foo)print "- - - - - - - -"# this will show extended logicdis.dis(bar)

The default version of foo should look like this:

21           0 LOAD_CONST               1 ('debug message')             3 PRINT_ITEM                       4 PRINT_NEWLINE       22           5 LOAD_GLOBAL              0 (True)             8 RETURN_VALUE        

But you should see that the optimized version of foo creates some very lean code:

22           0 LOAD_GLOBAL              0 (True)             3 RETURN_VALUE        

Now lets look at the bar function when unoptimized

26           0 LOAD_GLOBAL              0 (__debug__)             3 POP_JUMP_IF_FALSE       20             6 LOAD_GLOBAL              1 (False)             9 POP_JUMP_IF_FALSE       2027          12 LOAD_CONST               1 ('debug message')            15 PRINT_ITEM                      16 PRINT_NEWLINE                   17 JUMP_FORWARD             0 (to 20)28        20 LOAD_GLOBAL              2 (True)            23 RETURN_VALUE        

Which generates the same code as the optimized version:

26           0 LOAD_GLOBAL              0 (__debug__)             3 POP_JUMP_IF_FALSE       20             6 LOAD_GLOBAL              1 (False)             9 POP_JUMP_IF_FALSE       2027          12 LOAD_CONST               1 ('debug message')            15 PRINT_ITEM                      16 PRINT_NEWLINE                   17 JUMP_FORWARD             0 (to 20)28        20 LOAD_GLOBAL              2 (True)            23 RETURN_VALUE

So lets add a new function, biz

def biz():    """docstring"""    if __debug__:        if False:            print("debug message")    return True

The unoptimized code:

33           0 LOAD_GLOBAL              0 (False)             3 POP_JUMP_IF_FALSE       1434           6 LOAD_CONST               1 ('debug message')             9 PRINT_ITEM                      10 PRINT_NEWLINE                   11 JUMP_FORWARD             0 (to 14)35        14 LOAD_GLOBAL              1 (True)            17 RETURN_VALUE    

And a lot of that gets optimized away:

35           0 LOAD_GLOBAL              0 (True)             3 RETURN_VALUE        

Not many people use/abuse how the interpreter compiles code to their advantage, but if you do pay attention to constructs like this.

Using __debug__ is a great way to hide logging code on a production environment. The evaluation of __debug__ only happens once, when Python first compiles the code, so there is very little overhead.

deleted dogpile forks Leave a comment

I previously maintained some forks of Mike Bayers excellent dogpile.cache package that enabled some aggressive caching techniques, additional logging functions, and fixes to unit tests on Redis.

dogpile has since had some upstream changes, so Ive decided to delete the forks. most of the problems Ive had have been solved, and compatibility will be better maintained via plugins.

SocialMediaStandards.org Leave a comment

A few months ago I updated SocialMediaStandards.org with details on standard image sizes and best-practices for Social Media Distribution.

If you think Ive missed anything, please let me know!

Visit SocialMediaStandards.

Optimized Archiving of Historical Time-Series Data for Analytics Leave a comment

The (forthcoming) Aptise platform features a web-indexer that tabulates a lot of statistical data each day for domains across the global internet.

While we dont currently need this data, we may need it in the future. Keeping it in the live database is not really a good idea its never used and just becomes a growing stack of general crap that automatically gets backed-up and archived every day as part of standard operations. Its best to keep production lean and move this unnecessary data out-of-sight and out-of-mind.

An example of this type of data is a daily calculation on the relevance of each given topic to each domain that is tracked. Were primarily concerned with conveying the current relevance, but in the future we will address historical trends.

Lets look at the basic storage requirements of this as a PostgreSQL table:

| Column    | Format  | Size    || --------- | ------- | ------- || date      | Date    | 8 bytes || domain_id | Integer | 4 bytes || topic_id  | Integer | 4 bytes || count     | Integer | 4 bytes |

Each row is taking 20 bytes, 8 of which are due to date alone.

Storing this in PostgreSQL across thousands of domains and tags every day takes a significant chunk of storage space and this is only one of many similar tables that primarily have archival data.

An easy way to save some space for archiving purposes is to segment the data by date, and move the storage of date information from a row and into the organizational structure.

If we were to keep everything in Postgres, we would create an explicit table for the date. For example:

CREATE TABLE report_table_a__20160101 (domain_id INT NOT NULL, topic_id INT NOT NULL, count INT NOT NULL DEFAULT 0);| Column    | Format  | Size    || --------- | ------- | ------- || domain_id | Integer | 4 bytes || topic_id  | Integer | 4 bytes || count     | Integer | 4 bytes |

This structure conceptually stores the same data, but instead of repeating the date in every row, we record it only once within the tables name. This simple shift will lead to a nearly 40% reduction in size.

In our use-case, we dont want to keep this in PostgreSQL because the extra data complicates automated backups and storage. Even if we wanted this data live, having it within hundreds of tables is a bit much overkill. So for now, were going to export the data from a single date into a new file.

SELECT domain_id, topic_id, count FROM table_a WHERE date = '2016-01-01';

And were going to save that into a comma delimited file:

table_a-20160101.csv

I skipped a lot of steps above because I do this in Python for reasons Im about to explain.

As a raw csv file, my date-specific table is still pretty large at 7556804 bytes so lets consider ways to compress it:

Using standard zip compression, we can drop that down to 2985257 bytes. Thats ok, but not very good. If we use xz compression, it drops to a slightly better 2362719.

Weve already compressed the data to 40% the original size by eliminating the date column, so these numbers are a pretty decent overall improvement but considering the type of data were storing, the compression is just not very good. Weve got to do better and we can.

We can do much better and actually its pretty easy (!). All we need to do is understand compression algorithms a bit.

Generally speaking, compression algorithms look for repeating patterns. When we pulled the data out of PostgreSQL, we just had random numbers.

We can help the compression algorithm do its job by giving it better input. One way is through sorting:

SELECT domain_id, topic_id, count FROM table_a WHERE date = '2016-01-01' ORDER BY domain_id ASC, topic_id ASC, count ASC;

As a raw csv, this sorted date-specific table is still the same exact 7556804 bytes.

Look what happens when we try to compress it:

Using standard zip compression, we can drop that down to 1867502 bytes. Thats pretty good were at 25.7% the size of the raw file AND its 60% the size of the non-sorted zip. That is a huge difference! If we use xz compression, we drop down to 1280996 bytes. Thats even better at 17%.

17% compression is honestly great, and remember this is compressing the data that is already 40% smaller because we shifted the date column out. If we consider what the filesize with the date column would be, were actually at 10% compression. Wonderful.

Im pretty happy with these numbers, but we can still do better and without much more work.

As I said above, compression software looks for patterns. Although the sorting helped, were still a bit hindered because our data is in a row storage format. Consider this example:

1000,1000,11000,1000,21000,1000,31001,1000,11001,1000,21001,1000,3

There are lots of repeating patterns there, but not as many as if we represented the same information in a column storage format:

1000,1000,1000,1001,1001,10011000,1000,1000,1000,1000,10001,2,3,1,2,3

This is the same data, but as you can see its much more machine friendly there are larger repeating patterns.

This transformation from row to column is an example of transposing an array of data`; performing it (and reversing it) is incredibly easy with Pythons standard functions.

Lets see what happens when I use transpose_to_csv below on the data from my csv file

def transpose_to_csv(listed_data):    """given an array of `listed_data` will turn a row-store into a col-store (or vice versa)    reverse with `transpose_from_csv`"""    zipped = zip(*listed_data)    list2 = [','.join([str(i) for i in zippedline]) for zippedline in zipped]    list2 = '\n'.join(list2)    return list2def transpose_from_csv(string_data):    """given a string of csvdata, will revert the output of `transpose_to_csv`"""    destringed = string_data.split('\n')    destringed = [line.split(',') for line in destringed]    unzipped = zip(*destringed)    return unzipped

As a raw csv, my transposed file is still the exact same size at 7556804 bytes.

However, if I zip the file it drops down to 1425585 bytes.

And if I use xz compression Im now down to 804960 bytes.

This is a HUGE savings without much work.

The raw data in postgres was probably about 12594673 bytes (based on the savings, the file was deleted).
Stripping out the date information and storing it in the filename generated a 7556804 bytes csv file a 60% savings.
Without thinking about compression, just lazily zipping the file created a file 2985257 bytes.
But when we thought about compression: we sorted the input, transposed that data into a column store; and applied xz compression; we resulted in a filesize of 804960 bytes 10.7% of the csv size and 6.4% of the estimated size in PostgreSQL.

This considerably smaller amount of data can not be archived onto something like Amazons Glacier and worried about at a later date.

This may seem like a trivial amount of data to worry about, but keep in mind that this is a DAILY snapshot, and one of several tables. At 12MB a day in PostgreSQL, one year of data takes over 4GB of space on a system that is treated for high-priority data backups. This strategy turns that year of snapshots into under 300MB of information that can be warehoused on 3rd party systems. This saves us a lot of time and even more money! In our situation, this strategy is applied to multiple tables. Most importantly, the benefits cascade across our entire system as we free up space and resources.

These numbers could be improved upon further by finding an optimal sort order, or even using custom compression algorithms (such as storing the deltas between columns, then compressing). This was written to illustrate a quick way to easily optimize archived data storage.

The results:

FormatCompressionSorted?Size% csv+datecsv+date12594673100row csv755680460row csvzip298525723.7row csvxz236271918.8row csvYes755680460row csvzipYes186750214.8row csvxzYes128099610.2col csvYes755680460col csvzipYes142558511.3col csvxzYes8049606.4

Note: I purposefully wrote out the ASC on the sql sorting above, because the sort order (and column order) does actually factor into the compression ratio. On my dataset, this particular column and sort order worked the best but that could change based on the underlying data.

The Dangers of URL Shorteners Leave a comment

In preparation for the next release, I just did an audit of all the errors that Aptise/Cliquedin encountered while indexing content.

Out of a few million URLs, there were 35k well formed urls that couldnt be parsed due to critical errors.

Most of these 35k errors are due to URL shorteners. A small percentage of them are from shorteners that have dead/broken links. A much larger percentage of them are from shorteners that just do not seem to perform well at all.

I hate saying bad things about any company, but speaking as a former publisher I feel the need to talk bluntly about this type of stuff. After pulling out some initial findings, I ran more tests on the bad urls from multiple unrelated IP addresses to make sure I wasnt being singled out for suspicious activity. Unfortunately, the behavior was consistent.

The worst offenders are:

* wp.me from WordPress
* ctx.ly from Adobe Social Marketing
* trib.al from SocialFlow when used with custom domains fom Bitly

The SocialFlow+Bitly system was overrepresented because of the sheer number of their clients and urls they handle and I understand that may skews things but they have some interesting architecture decisions that seem to be reliably bad for older content. While I would strongly recommend that people NOT use any of these url shortening services, I more strongly recommend that people do not use SocialFlows system with a Custom Domain through bitly. I really hate saying this, but the performance is beyond terrible its totally unacceptable.

The basic issue across the board is this: these systems perform very well in redirecting people to newer content (they most likely have the mapping of newer unmaksed urls in a cache), but they all start to stall when dealing with older content that probably needs a database query. Im not talking about waiting a second or two for a redirect to happen Im consistently experiencing wait times from these systems that are reliably over 20 seconds long and often longer. When using a desktop/mobile web browser, the requests reliably timeout and the browser just gives up.

While wp.me and ctx.ly have a lag as they lookup a url to redirect users to a masked url, the SocialFlow + Bitly combo has a design element that works differently:

1. Request a custom shortened url `http://m.example.com/12345`
2. The shortened url is actually served by Bitly, and you will wait x seconds for lookup + redirect
3. Redirected to a different short url on trib.al (SocialFlow) http://trib.al/abcde
4. Wait x seconds for second lookup + redirect
5. Redirected to end url `http://example.com/VERY+LONG+URL

All the shorteners could do a better job with speeding up access to older and less popular content. But if I were SocialFlow, I would remove Bitly from the equation and just offer custom domains myself. Their service isnt cheap, and from the viewpoint of a publisher wait times like this are totally unacceptable.

WARNING: Dreamhosts One-Click Install for Trac still has a security flaw Leave a comment

I first noticed this two years ago (http://www.destructuring.net/2012/03/23/dreamhost-ux-security-flaw/) and contacted Dreamhost. It has yet to be fixed.

If you create a Private subversion repository on Dreamhost ( username + password are required to view ) and then add a One Click Install of Trac to that private repository ( which is marked as Private in their installer ), the Trac instance does not have any security permissions. The entirety of your source code is readable through the Trac browser.

Heres a illustration:

• Private SVN Repository http://svn.2xlp.com/ExampleTracSvn/svn
• Default Trac Install http://svn.2xlp.com/ExampleTracSvn/trac/browser/README.txt

While many people may want to have a Publicly readable repo for ticketing, I think its safe to say that most people who use a One Click Install are not familiar enough with the intricacies of Trac to know about its permissions system.

If youre affected the easiest fix you can implement, is to add a .htaccess file to your trac directory.

A better fix, is to get off Dreamhosts OneClickInstall entirely. The Trac One-Click-Install is a halfassed and terrible approach.

Dreamhost did something smart with their Subversion install. Your home directory has a `svn` subdirectory which contains some specific files for each subversion repo:

* RepoName/ ( the actual repo )
* RepoName.access (a .htaccess file for your repo )
* RepoName.passwd ( a htpassed file for the repos access file )

Its a very smart an elegant solution. The Trac install, however, is anything but.

1. Dreamhost installs one version of the Trac library in your home directory for each trac instance. If you have 5 tracs, you have 5 directories like `~/webroot_svn_2xlp_com_ExampleTracSvn_trac_trac`

2. Dreamhost installs the entire Trac environment and database in a web directory. The configuration files, database, everything, are available in an Apache served directory controlled only by .htaccess files.

A better way to manage Trac, is to create a `~/trac` subdirectory of your home folder, and centralize all your trac projects. You can then use .htaccess files and symlinks to expose the relevant directories to the internet at large.

This will guard you against situations where an erroneous .htaccess file renders your contents as raw-data ( it happens ).

If you have more than one Trac installation, you would probably benefit from installing Trac as a Python library and having multiple projects reference it.