Fetch articles' urls from Google Reader
Sometimes websites have no working archive section. This was the case with a website on which we were doing some research. Our solution was last day of Google Reader.
Google Reader keeps websites cache for some good time. We just have to extracti it from there. But as API is closed we’ll do a dirty workaround.
1. Run Chrome with --disable-security-flag
We’ll use it in order to overcome Same Origin policy
2. Run a server which will store data
Mine was running under http://localhost/keeper/
3. Open desired web page in Reader.
Set options (top panet) as Show all and Sort descending.
Now opent developer console and paste this code
- Sit back and wait
In order to increase speed just type in console
You can put a smaller number but take care as it has it’s limits.
Update
In the end it turned out that previous solution was killing browser after ~10k articles.
I tried to remove DOM elements, but it turned out that there was a callback error for removed elements and it was killing browser even faster.
So while digging for another solution I found out that reader is still using old good API (initially I was confused as I thought that Reader recieves data as scripts that are executed). Using the same script with small modifications and explanation of how to do pagination I was able to increase speed from 6k/h articles to 180k/h. So my task of getting links to 140k articles was done under 1 hour.
Here is the code (prototype.map and init functions stays the same):