Drop Down Menu

Drop Down MenusCSS Drop Down MenuPure CSS Dropdown Menu

Extracting selected information from thousands of emails into a csv file using mysql query

Sometimes things are not going to plan. Say you have a mobile application that thousand of users downloaded but the data collecting part was not set up correctly. Now you have an email account with thousands of emails containing excerpts of information (survey results) you now need to put somehow together.
What can you do in this situation? Is this information usable or you should not lose time with it?
Mobile applications are usually linked to an account and I assume you have access to such account.
Then when you look at the feedback, you will probably see hundreds and maybe thousands of messages with different subjects.

You first need to decide what you're looking for.

Let's say we're looking for our demographics survey results and we know that this information is emails having Demographics Survey as Subject.
The message will have some kind of information that might look like this:

Thunderbird has a very nice tool (Add-on) that extracts information from emails to a csv file. It's called. It's called ImportExportTools.
Once installed, it will show under the Tools menu in Thunderbird.

Thunderbird's great ImportExportTools Add-on

Go to Search and Export in that menu and select Subject -> contains -> 'Your text' (Demographics Survey in this case), select Spreadheet (CSV) at the bottom and click Export and Open.

At the bottom of the window you should see number of messages quickly being exported to your file ...

Now you should have your file called messages.csv and you can start working on it.
When you open it, it will be a mess of information. You need to find out whether the information you're looking for has been extracted within the same rows. 

Import this file into Excel or Open Office. Open
This is a critical part of the whole process and one that you almost certainly spend most time with. 

In Excel go Import -> csv File -> Select the file 
Choose Delimited

Just open it with any settings for now. (You will do it many more times after this.) 

Check if the information you're after is located on same rows. If the file is large, it might take a while to open.

If they are, repeat the steps until you get here and click next on the previous screen.

In our case, we located that information here:
So that's what we need to extract. This row repeated every hundred lines or so and all I needed is keeping just those rows and nothing else.

There are more ways of doing and I would be quite happy to see if Excel can do it but I decided to do it with MySql query.

What I cared for most were three columns: age, gender and place (called council here) and background so I deleted all columns to the right of those four.

Then I checked how many characters the largest cell had, using LEN function in Excel. I found that would be about 100 characters.

Having saved this file as a csv, I had a raw data set that could now be imported into Mysql database.

Importing the raw csv file into Mysql using phpMyAdmin.

You can easily set up Mysql databse with XAMPP.
There on locahost you will find phpMyAdmin. 

In phpMyAdmin go to Databases -> Create new database

Now create a table with 4 fields. You can call it Table.

Simply call them First, Second, Third, Fourth with the following settings. Here you need the information about the longest cell (that was 100 characters)

Now you have a database with one table containing 4 fields:

Now the 'raw' csv file needs to be imported into it. Click Import and Select file.
There are default settings:
They need to be changed as follows
Change semi-colon for comma
Delete quotes:

Click Go and wait.
If your csv file is large, this might take a while. For a moment it might look as if nothing is happening but then you will get something....
Chances are you won't do this at the first try

You can easily get an error here. It will most likely be the "Invalid field count in CSV input on line..." error. That means that Mysql doesn't like the data columns in your import file.
I got this error this time

If that happens, look for the problem in the next cell to the one in the error message. In this case, a fifth column (field) with information on row 497122 was superfluous. That needs to be deleted. Mysql expects only 4 columns, not five.

Repeat steps above until you get the following. You can empty the table before you Import again or Replace information instead (in which case you need to tick the Replace box).

You need a success message:

Now click Browse and you will see your information in the database.
It will be quite a mess but don't worry about it. 

You will now run a query that will clean it up.

You know that now you only need rows containing "gender", "age", "council", "background" in four columns respectively.

But because they are lined on the same rows, you only need to search for those starting with gender in the first row. My searched entries actually look more funny because they have this form:

but it could be whatever string, it doesn't really matter.

So, I am after "gender" with ANYTHING BEFORE and ANYTHING AFTER it.
While still in the 'Browse' Tab, click Edit (that is top right) and the following window will open
In that window run the following query
SELECT * FROM `Table` WHERE `First` LIKE "%gender%"
That is:

You will get only filtered entries with %gender% in it

As you can see, some of the entries are not perfect. You'd have to go back to the first csv file import in Excel and improve the Import settings there until the colums are separated better. This is however only for demnostration purposes.

What you need to do as the last step is to export that information to a new csv file.

The Export link is at the bottom of that page.

When you click it you get the following
Select CSV for MS Excel and click Go
Your table will download as csv file Table.csv

And that's more or less it but it's never easy. Repeat the steps and be patient until you get a sensible set of data.

No comments:

Post a Comment