Quinthar: July 2008 (a David Barrett blog)

Another YouTube lawsuit, more of the same

There are always two responses to this sort of thing.

One class says "Geez, <prosecution> are idiots for not recognizing the potential for new revenue and partnering with YouTube!"

The other class says "Yep, YouTube is a criminal racket hiding behind a thin veneer of flimsy, untested law -- it's amazing they've gotten away with it for so long."

Granted, both could be right (they're not strictly contradictory). But I tend to align more with the latter camp.

This doesn't mean I think YouTube is morally abject. Rather, I think the law is stupid. (Both the law they're guilty of breaking, and the law they use as a defense.) But the law is the law, and it's frustrating to see YouTube profit from such blatant criminal activity** while so many others -- most of who were far more creative in either trying to comply with or circumvent the law -- were ground into dust.

- david barrett

** Yes, I realize the jury's out on what fraction of today's traffic is copyright infringing. But there's little debate that YouTube's founding principle was massive copyright infringement, and only through a stroke of luck and the grace of time has managed to attract a sufficiently non-criminal userbase to maintain plausible deniability.

How does transactional memory improve disk commits?

Curtis, my top source of scoops, pointed me to this breathless review of Sun's new transactional memory hardware, making bold predictions that Sun will -- and I quote: "horsefuck the database world".

But unless I'm really missing something, how does this help? A database's bottleneck is not thread synchronization, it's atomic writing to disk. Sure, transaction management of in-memory data is costly. But not nearly as costly as writing updates to disk in an infallible manner.

Maybe there are specific database scenarios that really benefit from this. But a read-heavy scenario doesn't have a lot of transactional overhead, and a write-heavy scenario will be bound by disk write performance.

So again, where does this help? I must be totally missing something. Can you help steer me straight?

-david barrett

DVCS, YAVCS or something more?

Curtis pointed me to this interesting article questioning the hype surrounding distributed version control systems (DVCS -- ya, I hadn't heard the acronym before either).

I generally agree with him, but some of the comments were especially illuminating. In particular, I like the comment that it turns the relationship around: rather than a master "pushing" changes out to the masses, everybody has the option of "pulling" from whoever they like. I can see how this is particularly valuable in an open source environment without strong leadership, though less so in a commercial environment or an open-source community with a strong "mainline" (eg, Firefox).

The only feature that jumps out to me is offline commit, merely because I spend a lot of time offline (or did, until I got a laptop with built-in Sprint wireless broadband -- kick ass!). And the proposed "waypoint" addition to SVN would get me 80% there. (The other 20% would be sharing waypoints with other users, but in my experience that's an incredibly rare operation.)

But regarding the claims that DVCS is somehow fundamentally easier to use or less prone to weird boundary conditions (eg, "Git rocks because I hate it when I delete a modified uncommitted file using the OS after renaming the parent directory a prime number of times in a row"), I'm skeptical. Every VCS has dark corners where none dare tread.

Similarly, the claims "it's hard to set up a SVN repository" are pretty weak: cvsdude? rsync.net? Or even SourceForge? Likewise, setting up a local repository is kinda missing the point in a world of cloud computing: whether you like it or not, your laptop *will* break and your hard drive *will* fail. Anything worth version controlling is worth backing up remotely, and SVN is as good a tool as any for that.

And finally, maybe I'm just really missing something, but I just don't get the value of extreme branch/merge. I don't feel its complexity is what's preventing me from doing it -- I just never feel the need. Indeed, it seems to go against the "continuous integration" ethos of agile development: I'd much rather have a bunch of somewhat-broken code prematurely integrated than somewhat-broken code integrated too late at the very end.

Granted this is probably due to my history of working with small teams on tight codebases (where everybody constantly affects everyone else): I can imagine how branching/merging becomes more valuable as team size and developer-decoupling increases. And granted, I only very rarely submit or accept patches from contributors without direct commit access.

But all told, I agree with the B-List that the whole thing seems over-hyped. It's a tool. There are lots others. Pick whichever one you like and get to work.

-david barrett

Piracy At Work in Streams and Downloads

One critique of my previous post was that it doesn't account for streaming, such as from YouTube and MySpace. Ok, here goes:

First, to clarify, YouTube music is generally unlicensed and posted illegally, so it should probably go in the pirate column. (The DMCA gives protection to YouTube so long as it removes pirated material, but it doesn't change the fact that it's pirated in the first place.)

MySpace is a great example of legitimate streaming, though.

So even limiting the discussion to just streaming, I wonder about the ratio of pirated tracks streamed from DMCA-protected providers, versus legit tracks from MySpace/webcasting/etc. Do you have any data on this?

PC World suggests over a *billion* videos per day on YouTube. Let's say, what, half of those include unlicensed songs that aren't protected by fair use? 25%? Even at 10%, that means 100M pirated songs per day streamed from YouTube.

Now, how many legit songs are streamed from MySpace? Somewhere I saw 4.5B daily page views. How many of those have songs on them? I really have no idea -- I'm not a MySpace user. I'm going to guess 25%. Furthermore, my understanding is every page automatically starts playing music, so that would suggest like 1.1B legitimate songs started every day from MySpace -- dwarfing YouTube. Then again, how many of those pages actually play the whole song, rather than playing a few seconds before the next click? Again, no real idea, but I'd wager there's a lot of clicking going on, to the tune of 1 in 10 songs started actually being listened to for real. That would put us down to about 110M legitimate streams from MySpace a day, or roughly on par with YouTube's piracy. (In YouTube's defense, I wager most videos are actually played to completion.)

Oh, and to put it into context, my previous estimates put the number of songs bought from iTunes at 5.8M per day, suggesting about 116M downloaded per day from pirate networks. So the number of songs *streamed* per day is in the same ballpark as number of songs *downloaded*, despite the downloaded songs presumably being played again and again.

Anyway, this is all guesswork -- I'd love real data. But by my guesses I'd say piracy accounts for 95% of music downloads and maybe 25-50% of streaming downloads. Furthermore, the download market (which piracy unquestionably dominates) results in far more actual songs listened to.

If we said that Toyota accounted for 95% of cars and 25-50% of trucks (with cars outnumbering trucks 10:1), we'd probably say they dominated the auto industry. And thus I still think it's pretty safe to say piracy dominates all online music, streaming and downloaded alike.

-david

Sinking in a Sea of Piracy

This was my response to a post about the oddity of Yahoo's latest of a string of failures in music players and services, comparing it to the broader ecosystem of the music industry:

"I realize it's beating a dead horse, but you left out the largest, most vibrant part of the "ecosystem", the only part worth discussing: piracy.

I still don't understand why -- frankly -- anybody cares what Yahoo and Napster and even iTunes do. They're a largely irrelevant footnote, a strange march of impossibly flawed products targeting tiny niches of occasionally-curious users and manic well-wishers. That's like discussing operating systems exclusively in terms of Minix and OS/360, while occasionally remarking "oh, ya, and there's this thing called Windows, but whatever." **

And in this broader context, it doesn't seem strange at all that Yahoo's latest initiative failed. Just like every other of its kind. The world has very, very clearly said it doesn't want those, at every possible opportunity. History books will remark "At the turn of the 21st century there were a series of attempts to commercialize digital content by enforcing an antiquated notion called 'copyright' -- the right to control who could make copies -- all of which obviously failed."

If there's anything we "need to accept" it's that piracy is the biggest game in town, has been from the start, and will probably be forever.

Piracy *is* online music, and the only lasting success anybody is going to see in this industry is by augmenting or avoiding it. But not by competing head on.

-david

** OS/360 is the iTunes of the bunch: very relevant and commercially successful to a small demographic. Indeed its success is precisely due to a recognition of not being for everyone."

Second Collapse of the Internet Economy Underway?

The first internet bubble popped largely because all business models failed except for ad selling. Is it possible that the last stalwart hope is itself doomed?

TechCrunch reports that Lookery, a company specializing in selling ad inventory on social networks, is barely breaking even despite selling 3 *billion* ads per month. And rather than raising prices to become profitable, they're actually in the uncomfortable position of lowering prices 40% -- from 12.5 cents to 7.5 CPM. It reminds me of the (often unintentional) joke "We lose money on every transaction, but we make it up in volume!"

All this has made them so gloomy about the prospects of their core business that they're thinking of switching horses mid-stream and resurrecting that Web 1.0 favorite: selling demographic data. I mean, it worked so well the first time, why won't it work now?

Ok, so you might be saying "Sure, social network ads are crap, but Google's ads are solid, right? After all, they're set by the open market!" I thought that too, until recently I learned that rather than that market being open, it in fact is restricted by a series of minimum bids.

Don't believe me? Search for "Flash" and you'll see it has zero ads. In a totally free market, that means you have no competition, and thus should be able to bid as low as you want to get your ad to appear. But when you try to create an AdWord for the "Flash" keyword, you'll see it sets the minimum price at $0.10. So even if the market (me) only wants to pay $0.01, it's priced 10x higher than the market (I) will bear. Which is why there are no ads on the "Flash" keyword.

Said another way, there is no competitive pricing for the "Flash" keyword. Rather, the price is arbitrarily set by Google.

Now you might say "Well, Google owns the ad inventory, they can sell it for whatever they want; it's still a free market, even if they choose not to sell it for cheap." But wait -- didn't we just say prices are set by auction? Hm...

All this means that the auction only sets prices above a minimum. Which brings us to the $149.86B question: how many of Google's ads have prices set by auction, and how many are just coasting by on the minimum?

Hopefully, most of Google's ads are competitively priced via the auction. This would suggest that they're priced "correctly" and that we're in for no major shocks to ad revenue (and, due to Google's market share, worldwide ad revenue).

But let's say that some high fraction -- 50%? 70%? -- of Google's ads are in fact not competitively priced, but are just set arbitrarily by Google, such as Flash's $0.10 minimum. In this scenario, Google's revenue is no more protected from price declines than Lookery and it's 40% "going out of business" sale. (After all, anybody slashing prices while losing money can't have a long future.)

In this scenario -- which might be reality, depending on the data -- Google's pricing is not sustained by competition, but by a near monopolistic control of ad inventory. And in that maybe-reality scenario, the global ad market is over-priced, meaning Google and all other ad-supported online businesses are overpriced, meaning we're in for another massive internet economy collapse if Google ever loses its monopoly and is forced to truly compete on price.

Sound crazy? It's not nearly as crazy as what's already happening in reality: ad arbitrage. It works like this:

You buy a really cheap adword from Google in order to direct a lot of traffic to some site. And then you fill that site with ads with high CPM and CPC (perhaps from other ad networks, it doesn't matter). The result is you buy a click for $0.10, and then turn around and sell it for $1.00. How is that possible? Why isn't everybody doing it?

Everybody was doing it -- in huge quantities -- until Google killed it. How? By raising minimum bids.

That's right, by fiat Google leveraged its near monopoly power and raised prices to stop buyers it didn't like (spammers) from taking advantage of the mysterious imbalance between the price of an AdWord and another network. Specifically, it means that Google clicks should be able to buy for *cheap*, were it not for Google artificially raising prices.

It also means that advertisers on other networks are unwittingly paying *way too much* for what should actually be a really cheap click. In an efficient market, the advertiser should just be buying that original, really-cheap AdWord, rather than the inflated price of an ad on the intermediate spam site.

And all this comes back to the possibility that Google's AdWord inventory is actually overpriced, and it's only sustainable so long that Google enjoys near-monopoly status. Once that status is gone, then all keywords -- even the ones Google chooses to price out of the market -- become competitively priced, at rates far lower than what Google is currently charging. Which means everybody that depends on AdWord revenue suddenly makes less. Meaning the internet economy collapses. Again.

Crazy ramblings? I wonder.

Update: A friend (the same one who told me about ad arbitrage) pointed out that one reason there might be no ads on Flash is because it's a trademark. That could be -- I didn't go all the way through to pay the minimum bid and see what happens. But that doesn't change the fact that there are minimum bids on *all* unused keywords, including such words as "blah" or "quinthar". So the general argument still stands.

Update: So I decided to run my own Flash ad. I gave it a $25/day budget and bid the $0.10 minimum. It immediately showed up, and appeared every time I refreshed for at least several hours. According to Google, it was shown 6,607 times. But here's the interesting thing (which, frankly, completely demolishes my whole theory): it was pulled, despite it only costing me $0.40 (ie, with tons of budget remaining). Why? Because clickthrough was too low -- 0.06%, to be precise. So now I'm changing my theory. The reason there are so few Flash ads isn't that Google has priced the keyword out of the market. Rather, it's because it's difficult to make an ad that achieves sufficient clickthrough on such a general term as Flash. Even if you're willing to pay the minimum, Google isn't willing to show it unless it performs. Whether or not that's a problem (and I'm not sure it is), it's entirely different than what I initially was guessing, and completely undermines my theory of Google using monopolistic pressure to sustain noncompetitive pricing. So... never mind.

Hard to cry for Lyle Lovett

I'm not normally one to defend the music industry, but I don't know if I agree with this Techdirt article lamenting how Lyle Lovett sold 4.6 million albums but has "never made a dime" from album sales. Specifically, I take issue with this line:

"Of course, the truth is that it's quite rare for any musician to make money from selling their albums, as has been pointed out for years"

While that might technically be true, it ignores that Lyle Lovett almost certainly was paid a healthy advance -- precisely due to the anticipation of future sales.

And lo and behold, the sales rolled in. It says he's sold 4.6M albums. That's over a long period where I'm sure prices have fluctuated, but let's say a $10 average album price, so $46M in revenue. Thus if he had a $10M advance, that means he's earned 21% of revenue from his albums since 1991.

Is that good? Bad? Frankly, that sounds pretty good to me. If he were the founder of a startup, that's like selling for $46M and keeping a 21% stake -- it'd be widely seen as a success. Granted, that'd be a 17 year startup but still. Hard to criticize his deal. I sorta agree with Kurt: even if he's not paid per album, he's still well paid, and especially given he didn't take on any of the risk.

Then again, if his advance were much smaller -- say, $1M, then he's only looking at 2.1% of the album revenue, which isn't very cool at all.

Furthermore, that's only counting albums since 1991. He had three albums before that that aren't included in the above analysis...

So, it comes down to the numbers, but I don't think it's clear whether he's being screwed or not. Not to mention, he's still paying down his advance, and one day will probably start making money. Or maybe not. But that doesn't change the fact he was (probably) paid a crapload of money up front for a service that had yet to be performed. We should all be so lucky to be in such an unfortunate position.

ThePirateBay's Flawed Plan to Encrypt the Internet

Any idea what's up with this new plan from ThePirateBay to "encrypt the internet"? Is this trying to replace the TCP stack with an SSL stack -- every time you try to open a TCP connection it instead first tries to (essentially) open an SSL connection and, failing that, falls back onto regular TCP?

Amazingly, Wikipedia doesn't even have a page for it -- crazy!

This from the IPTEE website:

"As an example, when a new TCP connection is established with the 3-way handshake the crypto layer at the "initiator" will hold any pending application data and start the key negotiation process. When a session key has been established, the pending application data is encrypted and sent. All following communication on that TCP stream is then treated as encrypted traffic. Should the remote host not reply with a valid key-negotiation response, the TCP connection is closed, re-opened, and the application data passed through unencrypted."

Ug, so when this rolls out, every single TCP connection will start with a garbage request, fail, and then reconnect? This seems like it could lead to an *exceedingly* slow experience for anybody who installs this, as 99.999% of all TCP connections they attempt will fail, and probably not quickly but rather by timeout.

For example, the servers I write typically accept a TCP connection, and then keep receiving as much as possible until it gets a well-formed request. If it doesn't get a well formed request it'll eventually timeout. But -- and this is key -- a malformed request won't be instantly detected and trigger a disconnect. Indeed, this would be extremely difficult in the general case.

For example, assume your protocol used an HTTP-like structure with \r\n terminated lines, the first of which is just a big string, and each subsequent line is a "Header: value" pair, followed by a blank line (double \r\n). Super basic protocol, like this:

GET / HTTP/1.1
Host: quinthar.com

However, because the parser is general purpose, it doesn't know every possible valid "method" line. The parser takes anything that fits the pattern and passes it to a higher level for processing. This will therefore accept anything like:

POST / HTTP/1.1
Host: quinthar.com

Or even:

The quick brown fox jumped over the
Lazy: log

Normally that's fine -- after all, it's a rare situation for somebody to connect to you with a malformed protocol in the first place, and generally they're going to at least make a good effort so even if it's not totally perfect, leave it to a higher level to figure it out:

Get / Http/1.1
HOST: quinthar.com

But any user with this thing running will result in a hung connection. Rather, it's going to send some binary garbage like this:

sj459dofsfj2q9fdjqre

The parser will look at this and say "ok, it looks wierd, but maybe it's good" and it'll wait until it either gets a well-formed message, or its input buffer is exceeded, or a timeout occurs. In the case of IPETEE, it'll probably always be the last one. Meaning every TCP connection opened to my server will first have this incredible timeout before it gives up and falls back to TCP.

Ok, so now you're thinking "so what David, your servers suck anyway" That may be, but I note that Apache and IIS behave the same way. This means that virtually every HTTP request you make is going to be incredibly delayed. At least, until everybody in the world adopts this.

And that's just HTTP. I'm guessing this design pattern of protocol parsing is exceedingly common. Which means basically all TCP connections of all kinds would be incredibly delayed.

And all this ignores, of course, that this encryption layer doesn't even attempt to prevent man-in-the-middle attacks, so it only provides protection against passive observation. But any ISP that is seriously curious about what's going on inside could simply intercept each end and insert a sniffter. Let's say they do this in 0.001% of all TCP connections, over time they'll figure out who the pirates are (because BitTorrent opens up so many damn TCP connections, they'll come up pretty quick).

The upshot is I'm not sure that this is really a good idea. Anybody who installs it will be horribly slowed. Anybody who runs a server will see a crapload of timeouts and dead requests. And any ISP that really cares about observing users can still easily do so.

Encryption is a really hard problem. The only thing worse than no encryption is encryption that doesn't work, as it leaves you worse off than when you started.

-david

FUD and Exaggeration from the WaPo on Minor Security Breach

A new article in the Washington Post entitled "Justice Breyer Is Among Victims in Data Breach Caused by File Sharing" talks about how some idiot accidentally shared 2000 social security numbers of a lawfirm's high-profile clients. The article irks me for a couple reasons.

First, the leak is really quite small and insignificant, but the article blows it up like it's a huge thing. Sharing 2000 social security numbers of rich dudes is bad. But it's nothing compared to thousnds of hacked ATMs stealing card numbers *with PINs*, and sending them to a Russian hacker who has been draining bank accounts and has stolen at least $5 million *so far*, and hasn't yet been stopped. A little context please? (And the context provided in the article comparing it to a few other insignificant leaks isn't exactly helpful.)

But what bothers me even more is this completely false statement:

Robert Boback, chief executive of Tiversa, the company hired by Wagner to help contain the data breach, said such breaches are hardly rare. About 40 to 60 percent of all data leaks take place outside of a company's secured network, usually as a result of employees or contractors installing file-sharing software on company computers.

First, I don't even know what that means: how can it both be "outside a company's secured network" and "on company computers"? Or does "secured network" mean "the subset of the network that happens to not leak yet"? (Or does "network" mean "the office internet connection", without including the computers that connect to it?)

Regardless, it claims 40-60% of "all data leaks" are "usually as a result of ... file-sharing software". Where does that data come from? The only really exhaustive study I know on the subject was the Verizon one, and it came to a completely different conclusion:

Specifically, the words "p2p" and "file-sharing" and "limewire" don't appear anywhere in it. Furthermore, it says only 18% of leaks are due to insiders, and of those, only 3% were "inadvertent disclosure" (which I think would include accidentally sharing something on Limewire).

The upshot is the Verizon study suggests the exact opposite as this article: rather accidental file sharing being a significant source of leakage, it accounts for at maybe 0.54% of leaks.

So... what's up with the anti-P2P FUD?

Building a 1GB bootable qemu image using debootstrap

I've been a long qemu fan, and a new debootstrap groupie. The following script was built with the help of the debian-user list and combines these two affections -- with some fancy footwork -- to build a "from scratch" bootable qemu image. There are still some kinks to be worked out (specifically, I don't know where to get a "real" stage1, stage2, and e2fs_stage1_5 file, so it copies it from the host system), and right now it is fixed at 1GB (though that could be easily corrected), but otherwise it seems to work surprisingly well. Behold:

echo "Creating 1GB file of zeros in $1.raw"
dd if=/dev/zero of=$1.raw bs=1024 count=1048576

echo "Formating $1.raw with ext2 filesystem"
/sbin/parted $1.raw mklabel msdos
/sbin/parted $1.raw mkpart primary ext2 0 954
/sbin/parted $1.raw mkpart extended 954 1069
/sbin/parted $1.raw mkpart logical linux-swap 954 1069
/sbin/parted $1.raw set 1 boot on
/sbin/parted $1.raw mkfs 1 ext2

echo "Mounting $1.raw on $1.mount"
mkdir -p $1.mount
sudo mount -o loop,offset=16384 -t ext2 $1.raw $1.mount

echo "Installing Etch into $1.mount"
sudo debootstrap --arch i386 etch $1.mount http://ftp.us.debian.org/debian

echo "Setting up host networking in $1.mount for apt"
sudo cp /etc/resolv.conf $1.mount/etc
sudo cp /etc/hosts $1.mount/etc

echo "Installing kernel into $1.mount"
sudo chroot $1.mount apt-get update
sudo chroot $1.mount apt-get -y install gnupg
sudo chroot $1.mount apt-get update
echo "do_symlinks = yes
relative_links = yes
do_bootloader = yes
do_bootfloppy = no
do_initrd = yes
link_in_boot = no" > /tmp/kernel-img.conf
sudo mv /tmp/kernel-img.conf $1.mount/etc
sudo chroot $1.mount apt-get -y install linux-image-2.6-686

echo "Manually installing grub into $1.mount"
sudo mkdir -p $1.mount/boot/grub
sudo cp /boot/grub/stage1 $1.mount/boot/grub
sudo cp /boot/grub/stage2 $1.mount/boot/grub
sudo cp /boot/grub/e2fs_stage1_5 $1.mount/boot/grub
echo "default 0
timeout 0
title Linux
root (hd0,0)
kernel /boot/vmlinuz-2.6.18-6-686 root=/dev/hda1 ro
initrd /boot/initrd.img-2.6.18-6-686" > /tmp/menu.lst
sudo mv /tmp/menu.lst $1.mount/boot/grub
sudo echo "device (hd0) $1.raw
root (hd0,0)
setup (hd0)
quit" > /tmp/grub.input
sudo grub --device-map=/dev/null < /tmp/grub.input

echo "Configuring qemu networking"
echo "auto lo
iface lo inet loopback
allow-hotplug eth0
iface eth0 inet dhcp
" > /tmp/interfaces
sudo mv /tmp/interfaces $1.mount/etc/network

echo "Starting sshd and granting $USER a root key"
sudo chroot $1.mount apt-get -y install ssh
sudo chroot $1.mount /etc/init.d/ssh stop
sudo mkdir -p $1.mount/root/.ssh
sudo cp ~/.ssh/id_rsa.pub $1.mount/root/.ssh/authorized_keys
sudo chmod -R 755 $1.mount/root/.ssh

echo "Dismounting $1.mount"
sudo umount $1.mount

echo "Done. To start, run:"
echo ""
echo " sudo qemu -kernel-kqemu -redir tcp:2222::22 $1.raw"
echo ""
echo "To SSH in, run:"
echo ""
echo " ssh -p 2222 root@localhost"
echo ""

Bright Nets == Dark Futures (for the RIAA, at least)

I hadn't heard of this implementation before -- or the term "brightnet" -- but the general idea is to split up files into randomly XOR'd chunks, and then share those chunks via P2P. (I think some of the blocks are also pure random data, to confuse things further.) This means a single collection of blocks can generate multiple files, and there's no way to know from the outside which you want.

The technique thwarts the BitTorrent attack where you know someone has the file by the mere fact that you can download it from them. In this model, merely being able to download a block from a peer doesn't mean the peer has all the other blocks of that file, nor does it mean that that peer is using the block for illegal purposes. The same exact block could be used to legally construct a public domain song, or illegally construct an unlicensed copy of a copyrighted song.

Anyway, just more proof (as if we needed it) that the end of copyright is near. It'll still exist on the books, but just become increasingly unenforceable -- one of those quaint anachronisms that we'll scoff at while reminiscing with our grandchildren.

-david

PS: I alluded to this concept in a post to a mailing list on 6/26, but I had no idea any implementation was so advanced. My original post follows, and is in response to a poster claiming that even without the "making available" argument, the RIAA has plenty of tools to wage its anti-pirate campaign:

This seems to depend on three things:

1) Licensed copies can be distinguished from unlicensed copies.

2) It's possible to know who you downloaded a given file from.

3) Running a P2P service is generally regarded as shady activity

All these might be generally true now, but I think the trends work against all three. Once an enforcement regime that depends on any or all of the above comes into force, the pirates will just switch systems.

Granted, I think you're right: with enough work and forensic analysis and circumstantial evidence you'll be able to prove it to a jury. But it'll get really expensive to do this -- especially because pirate systems no longer advertise everything you've ever downloaded, and thus it's impossible to distinguish between a one-time and hard-core pirate (without just downloading an incredible amount of pirated material and looking for repeat offenders -- though with changing IP addresses and no permanent identifier, that gets hard).

The upshot is it might be a rather Pyrrhic strategy where the cost of suing a group of people exceeds the damages you get from the subset of people you win against.

----

Though not super related, it's a fun exercise to think how to develop a system that evades the above 3 forensic trails. I'd toss out:

1) Converge pirate networks on perfect duplicates of legitimate copies that are available somewhere online. Even if there are commercials embedded, come up with "metadata" that notes where the commercials are and program players to automatically skip over them. Create MP3 ripping tools that explicitly create binary identical files even when ripped by different people, thereby enabling the argument that you ripped it and threw the CD away.

2) Use onionskin routing to obscure the trail to the actual host of the content. Use file sharding such that everybody hosts a tiny fraction but nobody hosts the whole thing. XOR file shards such that the only way you can get a particular file shard is to combine two entirely different ones, so nobody is hosting even subsets of the file directly.

3) Build a P2P system that has both legitimate and illegitimate purposes. Have it implicitly "share" your entire hard drive, but it only actually responds to files with a given hash (thus any private information is implicitly protected because nobody knows its hash).

Taken all together, (1) makes your pirated content look potentially legitimate, (2) hides you when others download from you, and (3) lets you argue you're not a pirate but just enjoying a legitimate P2P network and -- golly, you didn't know it could be use for piracy! None of these are rock-solid defenses, but it's not really protecting against a rock-solid attack, either. The RIAA campaigns today are at best a break-even endeavor (when all the destruction of public sentiment is weighed in as a cost) -- if the cost could be magnified 2x, 5x, or 10x, then even they will give up.

-david