Monday, November 7, 2022

Data Mining Data Preparation

So far, I have touched only a little on data mining in related topics like social media content management and customer experiencecustomer relationship management (CRM). Although this post does not delve into data mining management, it deals with a preliminary step of the data mining process of social listening data; data preparation. 

When people ask questions to search engines, marketers 'listen'. Social listening tools inform marketers of specific questions their customers ask about the industry, brands and products associated with them and their competitors. Such a tool removes a lot of the guesswork from content planning for social media. However, this process requires the handling of CSV files or MS Excel files with large otherwise intimidating data files. This post is an extremely basic introduction for cleaning up and organizing data sets in MS Excel.

I recommend including data mining data preparation rules in your brand guidelines. Among other things, the techniques that can help to clean up your data include removing duplicate cases or valuessplitting text in single cells so that only some text is reassigned to other columnsremoving extraneous or reassigning textisolate new ideas caught within redundant phrasesswitching the order of words, using certain words and not others in specific contexts. 



Remove duplicate values 

CSV files downloaded from social listening sites (like AnswerSocrates.com) will likely have around 3 columns and hundreds of rows of data. This process removes duplications when all rows and columns, ie when values of all variables are simultaneously identical.

Seller

questions

Mary

2 apples?

Mary

1 apple?

John

2 apples?


  1. If you need to keep the original (for security), make a copy.
  2. Remove columns that do not interest you. For instance, most people are only interested in one column (with search engine questions) and therefore delete the other columns. 
  3. In the 'Data' tab > Data Tools > Remove Duplicates 
    1. Optional but highly recommended: Select 'my data has headers' and manually select the columns whose values should be used in the process. If you select only one column (like 'questions'), cases / rows with duplicate values (like '2 apples?') will be removed. 
Excel automatically selects all available columns and rows. In that case and using the example above, it will not recognize any duplicates because duplicates in such a scenario means that values for variables and cases are identical. So it will not remove the 2 Mary cases nor the 2 identical questions because the program assumes them to be unique).


Split text within columns to reallocate parts into separate columns

MS Excel calls this function 'text to column'. If you are doing social listening research for a spa and get 300 long search engine questions results containing the same keyword phrase 'healing massages', this function will separate the recurring text from the rest of the text to allow you to quickly analyze the data regarding the frequency of questions with the keyword (ie its popularity), concepts associated with that keyword and so on. 

MS Excel can break apart text within words and reallocate some parts of the text to separate columns. In short, rather than have only 1 column with long sentences, you can end up with 2 columns. This tidies up the data to make it easier to analyze and more readily identify noteworthy content.


Questions

2 apples for sale

Pears for Tom

2 apples for Mindy

Cherries for sale

many apples for sale


Here are 2 steps.

  1. As a preliminary step, you can add a symbol like '&' immediately after the keyword so that MS Excel can easily identify where it should separate the text. Select the column containing the text > Ctlr F (to find text) > Select the 'Replace' tab (to see not only the 'find' but also 'replace' option) > enter the keyword into the 'find' field and then the keyword followed by the '&' symbol , like 'keyword&' > Click 'replace all'. [NB the symbol will disappear once the separation occurs. it only serves the purpose of a separation trigger. Other triggers might be spaces that might be commas, spaces between text characters, etc].
  2. Select all of the cells with text to be altered > Data tab > Data Tools > Text to Columns > 'Delimited' > Next > In the 'other' field, enter the symbol you added in the previous text.
  3. Optional. Notice the preview of the delimitation. By default, the text before the separator remains in the original column while everything after that point is shifted to the adjacent column. If you wish, you can specify otherwise. For instance, if '&' after 'apples' was the separator and you wanted to see the original text as well as the 2 new columns, ie a total of 3 columns; with the separated keyword in column 2 and the remaining text in column 3; you will need to enter column 2 in the 'destination' field. Rather than $A$1 (the default), enter "=$B:$B"

         

        YouTube video: MS Excel - Text to Columns by Tutorials Point 


Reassign words and Remove extraneous text

Over time, you will notice a list of extraneous words that you can remove without altering the meaning of each entry. 

I use MS Excel's Find & Replace functions to reassign words or phrases. This technique applies one rule at a time. This applies if you prefer consistent use of certain words that mean the same thing. Example(s). if questions ask using synonymous variations like my type of skin, type of skin, skin type, my skin, kind of skin, you might prefer to change all of these to the most economical form 'skin type' and reserve the word 'kind of' for ingredients or other concepts. 

  1. Scroll through the data to verify that your rules can apply without altering the intended meanings of original phrases. If so, create an ordered list of rules for your brand guidelines (to save yourself or team members from reinventing the wheel the next time).
  2. Decide on a single rule like find 'what are' and replace it with nothing (or a blank space)
  3. Optional. Select the column(s) to which the rule should apply.
  4. Ctrl F (to find) > select the 'replace' tab > 
  5. Enter offending text into the 'find' field and also enter the new replacement in the 'replace' field empty.  
  6. Click OK


I use MS Excel's Find & Replace functions to remove extraneous text. This technique applies one rule at a time. These rules can clean up the data to make analysis easier. Example(s). If you bake vegan cakes, you will eventually realize that you can remove words like 'what are', 'what is', 'how are', "can I", "can you", "why is", "why are".

  1. Follow the same steps immediately above with the only following exception. 
Regarding the 'replace' field, leave it blank.


Isolate / collect new words caught within redundant  phrases 

If my business bakes 'vegan cake', 'vegan cake' will be a key phrase that recurs endlessly in data. In many cases, it can be entirely ignored from the data. However, since English allows speakers to insert other text like 'chocolate', 'strawberry', 'diabetic' and so on within that key phrase, like 'vegan chocolate cake', you will run the risk of losing the new inserted idea if you ignored the 'vegan cake' phrase. For this reason, you will need to collect the new text (in this example 'chocolate') from within the phrase. 

You can first isolate the entire phrase (using the technique of 'text to column' technique above), redundant and new part still in tact. There are several ways of  then achieving the extraction.

Option 1. If you can manage without the redundant phrase 'vegan cake', one option is to simply delete 'vegan' and then 'cake' using the technique of removing redundant text (above).


Option2. If you need to keep the redundant phrase 'vegan cake', do the following. 

  1. Using the find and replace technique (above), you can apply the rule Replace 'cake' with 'vegan cake' to insert 'vegan' before 'cake' in 'vegan chocolate cake'. This will result in 'vegan chocolate vegan cake'
  2. Use the technique of 'text to column' (above) to place a separating symbol like '&' BEFORE 'vegan cake'. This will result in 2 columns, the second of which will have isolated 'vegan chocolate' from 'vegan cake' .
  3. Use the find and replace technique (above) to remove 'vegan' from 'vegan chocolate'



Color code a cell according to values

Conditioning formatting can be used to color code your cells according to segments or other meaningful aspects of data before you can begin to sort and analyze them.

In this example, I want to highlight the background of all cells with a segment color when they contain a segment-specific keyword (like 'massage'). 

  1. Highlight all the cells to which the formatting rule should apply
  2. within the 'Home' tab > Conditional Formatting > New Rule > In the 'select a rule type', select 'Format cells that contain' > In 'Edit the rule description', select 'specific text', 'containing' and type in the keyword (like 'massage') > Click 'Format' button > select 'Fill' (for background color) > click the color you want > OK


Color code a row (a case) according to the value of 1 column (ie a variable)

This formatting rule will allow you to easily identify cases belonging to segments by background color along their entire row. MS Excel can be set to verify whether a rule is true or false. If it is true, the conditioning color format will apply. The converse is true otherwise. 

The rule is that if the region is "South", the background of the entire row should light up. However, when you apply that rule to all cells within row 2, it would not know which column is 'region'. Since all of the cells will receive the same formula, you need to give an absolute value, in this case the 'H' column. To do this, the formula should have the '$' symbol immediately before the 'H'.


1 Employee ID

First Name

Country

H

 

Region

Sales [USD]

2 ZWC1011

John

Austria

South

5971

3 ZWC1014

Charles

Denmark

Southwest

2790

4 ZWC1017

Joseph

Brazil

West

1634

5 ZWC1020

Robert

Sweden

East

8038


  1. Highlight the entire row or all of the cells to be affected by this rule (in this case, an entire row)
  2. From within the the 'Home' tab > Conditional formatting > New Rule > Select 'formula to determine which cells to format' > In the 'Edit rule description' cell, type =$h2="south". > 'Format' button > In the 'fill' tabl, select the color button you want > OK.



CONTENT RELATED TO MS EXCEL: BASIC DATA PREPARATION

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.