I wondered initially whether I should go thru the Tableau video on the Prep. Then I decided to see how our coach Carl Allchin did it by opening his solution in the Prep Builder. It is quite easy to read. I just repeated what he did for the project and validated my result against his. Voila I learnt the Tableau Prep Builder this way.
My project at work is different in that each conversation is much longer. In the #PreppinData project, each text field has a maximum of 20 words. In my work project, there are each conversation has about 200 words. A linear splitting would generate ~200 fields before pivot. I tried the linear approach, which results in the builder crash. I am told that Prep Builder can handle lots of rows. But its performance will be seriously deteriorated if the number of columns becomes large.
Then I realized how inefficient the linear approach could be, regardless of the crash.
The Divide and Conquer Approach
Here is the idea. Assuming the extra spaces have been removed, there are only a single space between words.
- Split the a string into 2 halves.
- Pivot the two halves
- Split the above half-sized field by half again
- Pivot the resulting quarter-sized fields
....
until each field contains a single word at most.
Theoretically, this dichotomy approach will requires O(Log2N) steps of split+pivots. N is the maximum number of words per text field.
This approach can be generalized to become a divide and conquer one. Split the string into M sub-strings of N/M long. Then pivot them...until each field contains only a single word.
In the particular #PreppinData Week 9 project. N=20.
-I chose to divide it into 3 sections of length 8. Last section has fewer than 8 words.
-Pivot the 3 sections
-Split the resulting string field to 8 fields of one single word.
-Pivot them to become a field of one single word
Here it's basically done.
This is the resulting workflow:
In comparison to the two solutions based on linear approach, the divide and conquer approach needs much less "Changes", the equivalent of calculated fields.
- Dynamic Solution: 6 stages + 45 changes
- Alternate Solution: 4 stages + 31 changes
- Divide and Conquer: 4 stages + 24 changes
The Divide and Conquer solution will need much less number of changes. The workflow can downloaded here.
Look into the Gory Details
Note one of the changes is being used to calculate the max length of the text fields.
The max is 20 in this project. After learning this, we can devise a special a divide and conquer solution. In general the dichotomic approach will always work in any cases.
3 changes are dedicated to splitting the text field into 3 sections of 8 or less words.
Add a comment