Get list of all email addresses who ever wrote to you

There are lot of people that don’t get the idea of BCC (blank carbon copy) and while sending an email to multiple people, expose all email to every reciever. It is a nice leak for those who collect email addresses.

Being interested how many uniqe email addresses I have on my GMail account, I did the following stept to find it out:

Make a backup to your PC of all your available emails. This may take a while if you have many emails. I used for this purpose IMAPSize. This is a simple desktop client that will download all the emails (from the folders which you’ll choose) and save them as .eml files.

Parse your backupped emails. For this purpose I created a simple python script:

# Import the email modules we'll need
from email.parser import Parser
# Import regular expression module
import re
# Import os for file management
import os
# Import time modules
from time import gmtime, strftime
# Import
from operator import itemgetter

# All found emails, with email as key, and number of matches as value
foundemails = {}
SAVE_ORDERED = True
EMAILS_PATH = 'IMAPSize_037\\backup\\gmail'

# Mail search pattern
mailpattern = re.compile(r'[\w\-\.]+@[\w\-\.]+\.+[a-zA-Z]{1,4}')
mailheaders = ['to', 'from', 'cc', 'bcc']

def listFiles(dir):
    subdirlist = []
    for item in os.listdir(dir):
        if os.path.isfile(os.path.join(dir,item)):
            print os.path.abspath(dir) + '\\' + item
            searchForEmails(os.path.abspath(dir) + '\\' + item)
        else:
            subdirlist.append(os.path.join(dir, item))
    for subdir in subdirlist:
        listFiles(subdir)

def searchForEmails(file):
	global foundemails, mailheaders, mailpattern

	headers = Parser().parse(open(file, 'r'))
	for head in mailheaders:
		if headers[head] != None:
			for address in mailpattern.findall(headers[head]):
				if address in foundemails:
					foundemails[address] += 1
				else:
					foundemails[address] = 1

listFiles(EMAILS_PATH)

# Write results in file
f = open('emails_'+strftime("%H-%M-%S", gmtime())+'.txt','w')
if SAVE_ORDERED:
	# sorting by creating a list of sorted tuples
	foundemails_sorted = sorted(foundemails.iteritems(), key=itemgetter(1), reverse=True)
	for key, value in foundemails_sorted:
		f.write(key + "\t" + str(value) + "\n")
else:
	for key in foundemails:
		f.write(key + "\t" + str(foundemails[key]) + "\n")
f.close()

You just have to setup EMAILS_PATH and run this script. It will create a file which will contain all the email addresses and their occurence number, like this:

[email protected]	17905
[email protected]	13459
[email protected]	6980
[email protected]	3480
[email protected]	3071
[email protected]	2563

If you are subscribed to any groups, they also will be listed in here. You can filter such emails if you need it.

Regexp used in here is very primitive, but if you need to find email that follow LDH rule (letters, digits, hyphen) you can use the following regexp (but it is more power consuming) (\w{1}(?:[\w\-\.]+\w{1})*@(?:(?:(?:\w(?:\-\w)*)+\.)+\w{1,4}))