Workflow Usability

From WorkflowSharing
Jump to: navigation, search

Contents

Introduction: 2011 Wings Summer Internships on Workflow Usability

This document describes the activities of three student interns that worked on the Wings project in the Summer of 2011 at ISI. Over a period of a week, the interns:

  • became familiar with workflows as a software paradigm
  • learned to use Wings and run simple workflows to analyze data (e.g., compare sets of html files to see if they are on the same topic)
  • learned to use pre-existing workflows for advanced text analytics (e.g., workflows for document clustering and topic detection)
  • programmed new workflow components to improve existing workflows
  • analyzed twitter data to detect topic trends by applying pre-existing advanced text analytic workflows

The rest of this document contains their report describing these activities and their findings.

This work was supported in part by the grant Towards Shared Repositories of Computational Workflows, funded by the National Science Foundation with award number IIS-0948429 from September 2009 to August 2011.

Day 1: Learning Simple Text Processing Workflows

Goals

  1. [x] Learn what ISI is
  2. [x] Learn how to use Wings
  3. [x] Experiment with the comparison software, see if it works
  4. [x] Read the tutorial on Wings

Accomplishments

  1. We accomplished with the help of a pamphlet. ISI is a research facility that is part of USC whose projects focus on a branch of Computer Science called Artificial Intelligence such as Machine Translation, Natural Language Technology, Robotics, Semantic Workflows for e-Science, Information Integration, and more. The pamphlet also had a list of the faculty working on artificial intelligence which of course included Yolanda Gil, the professor giving us the opportunity for this internship.
  2. We received a great introduction to Wings from Yolanda Gil. Wings allows professors and students to compare documents for text analytics by going through a workflow (see below). Two documents run parallel to each other as the workflow slowly breaks up words into different lines. Next, the workflow removes trivial words such as 'and', 'the', 'or', and others that would not benefit the comparison program. The removal of special characters such as '?', '!', '©' etc. Afterwards, the workflow deletes multiples of words and instead of 'car car' changes it to 'car 2', showing simply that the word 'car' showed up twice in an article. Finally, the workflow compares the word lists. The more similar two documents are, the closer their similarity number is to 0.0.
  3. We compared Wikipedia articles of George Washington and a potato seeking their 'similarities'. Surprisingly, according the the program they had more in common than Harry Potter and the Twilight series. It proved to be more valuable than just a small laugh since it let us realize how it worked and also because it showed us that the programming for the comparison part of Wings wasn't reaching its full potential, leaving us interested in possibly remodeling it.
  4. Done at the end of the day. We learned how to operate Wings fully, from the basics to the complexities.

The workflow that we used is:

This is the workflow.

Day 2: Learning Workflows for Text Analytics

Goals

  1. Summarize the workflows that Matheus had.
  2. Edit and Save the steps of the Workflows
  3. Run Matheus' workflow with different percentages.
  4. Make a graph with all the different runs, and explain it.
  5. Establish a better equation to compare the frequency of words in Java for Wings.
  6. Write the Line Count in Java for Wings.

Accomplishments

  1. This goal was accomplished in the 10 o'clock meeting. The difference between the default workflow (above) and Mattheus' workflow (below) was that suffixes were cut off of the words so that the words 'computers' and 'computing' become 'comput', making it easier to compare the words in the documents by focusing on their roots instead of their specific forms.
  2. In the 10 o'clock meeting we added a step called Vocabular.
  3. Before the workflow execution, there was an option to use your choice of several Execution Templates, and we were told to select the apparently more accurate Naive Bayes template. Using the Naive Bayes, we tested several difference percentages which stood for the percentage of included instances.
  4. Added all the different runs and percentages that we ended up with into a graph, explanation and image of the graph below.
  5. The current method that Wings is using to compare the frequency of words in two documents is the Difference method. In this situation, Document 1 is the Prius article and Document 2 is the Parking Lot article. The Prius article has 500 words overall but 25 of those words is 'car', while the 6000 word Parking Lot article has the word 'car' 100 times. The word car is found less in the Prius article than the Parking Lot article, but that’s obvious since the Parking Lot article is longer; however, the word ‘car’ is more important in the Prius article but obviously doesn’t seem important compared to the occurrence of ‘car’ in the Parking Lot. The Difference method would not emphasize the importance of the word 'car' in the Prius article, since the calculation would be: |25-100|=75. Therefore, that method doesn’t work. However, the Percentage Method changes that by making the word 'car' in the Prius article be of 5% importance while the word 'car' in the Parking Lot article be of 1.6% importance. We agreed that the Percentage Method would be the best way to compare all the words in the documents Wings is given to analyze since it measures the importance of every word in each article instead of the importance of a word from a 500 word article to a 6000 word article, which lessens its value.
  6. Prior to finishing the program, the work day ended and we were forced to save the rest of the goal for the next day.

The workflow that we used is:

This is the workflow.


Explanation of Graph

Graph.png

The above graph represents the relationship between the percentage of included instances to the percentage of correctly classified instances. The graph demonstrates that at around 30% of included instances, the percentage of correctly classified instances levels for the certain data set we used. Andrew and Angela ran many workflows for every multiple of five, and Megha ran a couple workflows and recorded all the data to create this graph.

Day 3: Improving the Workflows

Goals

  1. [x] Finish Programming the Line Count program for Wings
  2. [x] Finish Programming the Comparison program for Wings
  3. [x] Edit the Wiki
  4. [x] Come up with individual project ideas.

Accomplishments

  1. Done, there was only a little left over from yesterday to complete. All 3 interns contributed to the process, either typing, scanning for errors, or researching programming functions.
  2. The word database files were proven to not be helpful so while two interns programmed with the help of Mattheus, one created three text edit word database documents. One of these documents was a word database from a fake elementary school article, the next was from a fake college article. Lastly, we made a third text edit document that had nothing in common with the first two, a word database made from a fake cat article. On Thursday we finished and edited the Compare Program, and uploaded it to Wings. It worked, and the output gives the percentage likelihood of the two input files, using an algorithm we came up with.


Here is the new comparer java component:

       import java.io.*;
       public class Comparer {
       public static void main (String [] args){
       double sim = 0;
       double wrongSim = 0;
       double totalSim = 0;
       int wordCount1 = 0;
       int wordCount2 = 0;
       File file1 = new File (args[0]);
       File file2 = new File (args[1]);
       int wordCountFile1 = 0;
       int wordCountFile2 = 0;
       int counter = 0;
       String line;
       String[] wordArray = new String[2];
       Word[] words1;
       Word[] words2;
       try {
           BufferedReader wordCountReader1 = new BufferedReader(new FileReader(file1));
           BufferedReader wordCountReader2 = new BufferedReader(new FileReader(file2));
           BufferedReader wordReader1 = new BufferedReader(new FileReader(file1));
           BufferedReader wordReader2 = new BufferedReader(new FileReader(file2));
           while (wordCountReader1.readLine()!=null){
               wordCountFile1++;
           }
           words1 = new Word[wordCountFile1];
           while (wordCountReader2.readLine()!=null){
               wordCountFile2++;
           }
           words2 = new Word[wordCountFile2];
           while ((line=wordReader1.readLine())!=null) {
               wordArray = line.split(" ");
               words1[counter] = new Word(wordArray[0],Integer.parseInt(wordArray[1]));
               counter++;
           }
           counter=0;
           while ((line=wordReader2.readLine())!=null) {
               wordArray = line.split(" ");
               words2[counter] = new Word(wordArray[0],Integer.parseInt(wordArray[1]));
               counter++;
           }
           for (int i = 0; i < words1.length; i++) {
               wordCount1 += words1[i].frequency;
           }
           for (int i = 0; i < words1.length; i++) {
               words1[i].setPercentage(wordCount1);
           }
           for (int i = 0; i < words2.length; i++) {
               wordCount2 += words2[i].frequency;
           }
           for (int i = 0; i < words2.length; i++) {
               words2[i].setPercentage(wordCount2);
           }
           boolean matchFound;
           for (int i=0; i<words1.length;i++) {
               matchFound = false;
               for (int j=0; j<words2.length && matchFound==false;j++) {
                   if ((words1[i].name).equals(words2[j].name)) {
                       sim += 100*(Math.sqrt(Math.abs((words1[i].percentage-words2[j].percentage)*(words1[i].percentage-words2[j].percentage))))/(Math.sqrt(wordCountFile1 + wordCountFile2));
                       matchFound = true;
                   }
               }
               if (!matchFound) {
                   sim += 100*(words1[i].percentage)/(Math.sqrt(wordCountFile1 + wordCountFile2));
               }
           }
           for (int i=0; i<words2.length;i++) {
               matchFound = false;
               for (int j=0; j<words1.length && matchFound==false;j++) {
                   if ((words2[i].name).equals(words1[j].name)) {
                       matchFound = true;
                   }
               }
               if (!matchFound) {
                   sim += 100*(words2[i].percentage)/(Math.sqrt(wordCountFile1 + wordCountFile2));
               }
           }
           matchFound = false;
           for (int i=0; i<words1.length;i++) {
               wrongSim += 100*(words1[i].percentage)/(Math.sqrt(wordCountFile1 + wordCountFile2));
           }
           for (int i=0; i<words2.length;i++) {
               wrongSim += 100*(words2[i].percentage)/(Math.sqrt(wordCountFile1 + wordCountFile2));
           }
           totalSim = 100-((sim/wrongSim)*100);
       } catch (IOException e) {
           e.printStackTrace();
       }
       System.out.print(totalSim+"% similar");
   }
}


This is the creator of the workflow's compare component. As you can see, the program, especially the algorithm, is very different from ours.


On Megha's computer we typed this up and inserted it onto the workflow on Wings. We deleted the old compare component and uploaded our new one, called compareNew. Megha then ran it using the wikipedia HTML files for Chicago and Los Angeles. The parameter was set to 4. The output using compareNew was 55.182837% similar. Using the same data with old compare component, we achieved the result 7.249548. Clearly, our output gives a better sense of what the similarity is. Even though a user would not know an algorithm unless she looked at the code, the user would understand that it is scaled in the sense that 100% similarity is when they are the same document and you can never get a higher number. The old output left us confused as to what it was measuring, and what that number represented, and thus this was one of the issues that we fixed.

Day 4: Individual Projects

Goals

  1. [x] Work on Individual Projects
  2. [x] Learn about Twitter Dataset

Fix HTML Programming on Wings

by Andrew Friedman

Originally, the HTML tag removal component in Wordbags worked, but it left behind large amounts of excess notation and scripts from the page source. To fix this, the entire component underwent a re-write, and it is currently more effective and better organized. It begins by removing the entire header (in an HTML document, the first part of true coding appears in a section of the document tagged as such: <head>...</head>). Because this section has no relevance regarding the topic of the article, the program automatically removes the tags as well as what is contained in them. It then continues to search the document for the character "<" (signifying the beginning of an HTML tag) or the characters "/" followed by "*" (signifying a comment). The program is writing a new document that does not contain the tags/comments. The program finally searches the document for any remaining scripts and HTML, and outputs a file. Currently, it is not perfect (as there are many ways to format HTML, and some are unpredictable), but it is certainly better than the original.

My Code (java):
    import java.io.*;
    public class htmlToText {
         public static void main(String[] args){
            Remover remover = new Remover();
            File htmlFile = new File(args[0]);
            String line;
            boolean endFound = false;
            char[] charArray;
            String originalDocument = "";
            String finalDocument = "";
            try {
                BufferedReader reader = new BufferedReader(new FileReader(htmlFile));
                while ((line=reader.readLine())!=null) {
                    originalDocument += line;
                }
            } catch (IOException e) {
                e.printStackTrace();
            }
            charArray = originalDocument.toCharArray();
            String s = remover.removeHead(charArray);
            charArray = s.toCharArray();
            for (int i = 0;i < charArray.length;i++) {
                if (charArray[i]=='<' || (charArray[i] == '/' && charArray[i+1] == '*')) {
                    int j = i+1;
                    while (!endFound) {
                        if (charArray[j] == '>' || (charArray[i] == '*' && charArray[i+1] == '/')) {
                            endFound = true;
                            j++;
                            i++;
                        } else {
                            j++;
                            i++;
                        }
                    }
                    endFound = false;
                } else {
                    if (charArray[i] != '\t') {
                        finalDocument += charArray[i];
                    }
                }
            }
            remover.removeScripts(finalDocument);
            char[] tempArray = finalDocument.toCharArray();
            charArray = null;
            finalDocument = "";
            for (int i = 0; i < tempArray.length;i++) {
                if (tempArray[i]==' ') {
                    tempArray[i] = '\n';
                }
            }
            for (int i = tempArray.length-2; i >= 0;i--) {
                if (tempArray[i]=='\n' && tempArray[i+1]=='\n') {
                    tempArray[i+1] = '\u0000';
                }
            }
            if (tempArray[0] == '\n') {
                tempArray[0] = '\u0000';
            }
            boolean backBreak = false;
            if (tempArray[tempArray.length-1] != '\n' || tempArray[tempArray.length-1] != ' ') {
                backBreak = true;
            } else {
                int c = tempArray.length-2;
                while (!backBreak) {
                    if (tempArray[c] == '\n' || tempArray[c] != ' ') {
                        tempArray[c] = '\u0000';
                    } else {
                        backBreak = true;
                    }
                }
            }
            for (int i = 0; i < tempArray.length;i++) {
                if (tempArray[i] == ('&') && tempArray[i+1] == 'n' && tempArray[i+2] == 'b' && tempArray[i+3] == 's' && tempArray[i+4] == 'p' && tempArray[i+5] == ';') {
                    i+=5;
                } else {
                    finalDocument += tempArray[i];
                }
            }
            System.out.print(finalDocument.toLowerCase());
        }
    }

This is the original code which was created by the creator of the workflow we used.

    public class Remover {
        public Remover() {
        }
        public String removeHead(char[] c) {
            char[] charArray = c;
            boolean endFound = false;
            String newDoc = "";
            for (int i = 0; i<charArray.length; i++) {
                if (charArray[i] == '<' && charArray[i+1] == 'h' && charArray[i+2] == 'e' && charArray[i+3] == 'a' && charArray[i+4] == 'd' && charArray[i+5] == '>') {
                    int j = i;
                    while (!endFound) {
                        if (charArray[j] == '<' && charArray[j+1] == '/' && charArray[j+2] == 'h' && charArray[j+3] == 'e' && charArray[j+4] == 'a' && charArray[j+5] == 'd' && charArray[j+6] == '>') {
                            j+=6;
                            endFound = true;
                        } else {
                            j++;
                        }
                    }
                    i = j;
                } else {
                    newDoc += charArray[i];
                }
            }
            return newDoc;
        }
        public String removeScripts(String s) {
            char[] charArray = s.toCharArray();
            boolean endFound = false;
            String newDoc = "";
            for (int i = 0; i<charArray.length; i++) {
                if (charArray[i] == '<' && charArray[i+1] == 's' && charArray[i+2] == 'c' && charArray[i+3] == 'r' && charArray[i+4] == 'i' && charArray[i+5] == 'p') {
                    int j = i;
                    while (!endFound) {
                        if (charArray[j] == '<' && charArray[j+1] == '/' && charArray[j+2] == 's' && charArray[j+3] == 'c' && charArray[j+4] == 'r' && charArray[j+5] == 'i' && charArray[j+6] == 'p') {
                            j+=6;
                            endFound = true;
                        } else {
                            j++;
                        }
                    }
                    i = j;
                } else {
                    newDoc += charArray[i];
                }
            }
            return newDoc;
        }
    }

Create Several Graphs based on options other than 'Naive Bayes' and Compare

by Angela Knight

LibLinear and LibSVM are supplementations of the same algorithm so their results were similar, but for the particular comparison LibLinear was proven to be better by having a higher score than LibSVM's high, low, and average. NaiveBayes and kNN have different algorithms, which explains why their results were so different than the Lib results.

LibLiner.jpg LibSVM.jpg

LibLinearHigh=86.55(55%) LibLinearLow=83.95(1%) LibLinearAverage=85.75 ________________ LibSVM high = 85.11(60%) LibSVM low = 83.3(15%) LibSVM average = 84.52


NaiveBayes.jpg KNN.jpg

NaiveBayesHigh=83.88(50%) NaiveBayesLow=18.37(1%) NaiveBayesAverage=77.15 _______ kNNhigh=77.95(1%) kNNlow=67.1(55%) kNNaverage=68.83

Add to the Special Characters and Trivial Word Lists as well as fix their Functions

by Megha Srivastava

The special character file on Wings WordBags, which has a list of special characters that gets removed from the inputted files since they are irrelevant, only includes 10 special characters. This is being added to by looking up more special characters. The trivial words file is also being modified to include more trivial english words, ( like might, maybe, is, are, etc.) to make the program more accurate.


Special Characters: The characters are surrounded by .* because that is how the workflow will process the characters. Some are preceded by a back slash, and I looked up on the internet to figure out which special characters required a back slash before them so that they can be processed. Below are the characters that I added:


.*‘.* .*’.* .*‚.* .*“.* .*”.* .*„.* .*†.* .*‡.* .*‰.* .*‹.* .*›.* .*♠.* .*♣.* .*♥.* .*♦.* .*‾.* .*←.* .*↑.* .*→.* .*↓.* .*™.* .*!.* .*“.* .*#.* .*$.* .*%.* .*&.* .*‘.* .*\(.* .*\).* .**.* .*+.* .*,.* .*-.* .*..* .*/.* .*:.* .*;.* .*<.* .*=.* .*>.* .*?.* .*@.* .*\[.* .*\.* .*\].* .*_.* .*`.* .*{.* .*|.* .*}.* .*~.* .*–.* .*—.* .*¡.* .*¢.* .*£.* .*¤.* .*¥.* .*¦.* .*§.* .*¨.* .*©.* .*ª.* .*«.* .*¬.* .*®.* .*¯.* .*°.* .*±.* .*².* .*³.* .*´.* .*µ.* .*¶.* .*·.* .*¸.* .*¹.* .*º.* .*».* .*¼.* .*½.* .*¾.* .*¿.* .*À.* .*Á.* .*Â.* .*Ã.* .*Ä.* .*Å.* .*Æ.* .*Ç.* .*È.* .*É.* .*Ê.* .*Ë.* .*Ì.* .*Í.* .*Î.* .*Ï.* .*Ð.* .*Ñ.* .*Ò.* .*Ó.* .*Ô.* .*Õ.* .*Ö.* .*×.* .*Ø.* .*Ù.* .*Ú.* .*Û.* .*Ü.* .*Ý.* .*Þ.* .*ß.* .*à.* .*á.* .*â.* .*ã.* .*ä.* .*å.* .*æ.* .*ç.* .*è.* .*é.* .*ê.* .*ë.* .*ì.* .*í.* .*î.* .*ï.* .*ð.* .*ñ.* .*ò.* .*ó.* .*ô.* .*õ.* .*ö.* .*÷.* .*ø.* .*ù.* .*ú.* .*û.* .*ü.* .*ý.* .*þ.* .*ÿ.*


English Trivial Words: Sorted alphabetically, I added a considerable amount of words using internet and Microsoft Word resources, and looking at sample sentences to see what words don't mean much but are commonly used. I also added each letter by itself in the hope that the workflow will remove single standing letters as the parameter program did not seem to work. Here are the list of the words:


a able about across after all almost also am among appear an and any are as at b be because being become became been but by c can cannot could come d dear did do does e either else ever every each f for from g get got h had has have he her hers him his how however i if in into is it its, j just k l least let like likely m many may me might most must my n neither no nor not o of off often on only or other our own p part r remain rather s said say says seem shall she should since so some t than that the their them then there these they this tis to too u us v w was wants was we were what when where which while who whom why will with would x y yet you your z


In the old workflow, the program removing trivial words was designed to remove the whole line if a trivial word were to be find. Following this example, if a line consisted of an important sentence, that whole sentence will disappear if the sentence had the word "the". This skewed the results greatly, and by ensuring that the input document was a word per line, the problem was fixed so that removing the whole line was equivalent to removing a single word.


Adding more characters and words to ignore lessened the likeliness between to files. Therefore, the number by using the old workflow's compare component generated a larger number. However, with our new Compare component, the number was smaller since we showed the percentage likelihood.


This is the python code for the component that removes a pattern, and was uploaded by the creator of the workflow that we used. Click the bottom right symbol to enlarge.


As one can tell from the photo of the python code, there is the major problem as we discussed above. It was that a whole line gets removed when a character or word is found. After discussing, we fixed this by having a word per line for the input.


Another way I could have changed this program is that instead of having a file for special characters, we could just check to make sure that only the letter a-z, A-Z were used. I used this method in a later assignment (Detecting Twitter Trends with Text Analytics Workflows) and it is definitely much simpler than having another input file. Also, by checking what was wrong with the parameter program, and fix it, we could ensure that many trivial words would be gone since they are composed of few letters.



A comparison of the words left using different files for the list of trivial english words.


The reason I was able to catch the mistake of the program deleting a line if a trivial word or character was found was by running the same data and parameter setting for both trivial sets. Many words that should not have been deleted were, as can be seen in the picture.


Both texts in the picture show the list of words left in the Los Angeles file for comparison, after going through all but the last step of the workflow. However, the one on the left was the text remaining after using my trivial word file while the one on the right was the text remaining after using the creator's word file. As you can see, words like "American" were deleted and after checking I figured it was due to the fact that a trivial word like "the" or "many" or "an" must have been on the same line, and thus "American" got deleted.

Day 4/5: Detecting Twitter Trends with Text Analytics Workflows

We all worked on different parts of helping prepare the twitter file for the workflow.


Angela

My goal was to separate the English tweets from the Spanish, French, and Dutch tweets by creating a list of stop words (and, a, them, etc.) in those languages. After I created that, I created a program. In this program, if any of the words in the foreign language stop-word list were spotted in a line (a tweet), then that line would be labeled as English = False, English being a Boolean. At the very end of the while-loop, I made the program create a .txt file to print all the instances where English=true.

Yay.jpg

Andrew

I did the second part of the project, which simply was to remove HTML tags (<a>...</a>) that would obscure the data. Then, with the help of Matheus, I optimized the program ultimately giving it the ability to complete the entire dataset (approx. 252,000 tweets) in a matter of seconds. Here is the code:
    import java.io.BufferedReader;
    import java.io.File;
    import java.io.FileReader;
    import java.io.IOException;
    public class htmlToText {
        public static void main(String[] args){
            File file = new File("/Users/internship/Desktop/twitter_haiti.txt");
            String line;
             try {
                 BufferedReader reader = new BufferedReader(new FileReader(file));
                 while ((line=reader.readLine())!=null) {
                     if (line.contains("</a>")) {
                         line = line.replaceAll("<a.*>.*</a>", "");
                     }
                     System.out.println(line);
                 }
             } catch (IOException e) {
                 e.printStackTrace();
             }
         }
     }

Megha

I did the last step, which was creating a program to remove special characters, urls, numbers, etc. It generates a file with the date, a space, the twitter user id, another space, and the "cleaned out" tweet, that was all letters, on a single line. This helps finalize the data so that it can go through the workflow.


While writing this program with the help of Matheus I learned about De Morgan's Laws regarding and and or. De Morgan's Laws state that give two operands A and B, NOT A or NOT B=NOT (A and B). In Java this is the same as: !A || !B = ! (A && B). Learning this was helpful when writing "if..else" statements.


This is the program:


    import java.io.BufferedReader;
    import java.io.File;
    import java.io.FileReader;
    import java.io.IOException;
    public class Match {
       public static void main(String [] args){
            int count=0;
            File t = new File("/Users/internship/Desktop/twitter_haiti.txt");

             try {

               String line = " ";

               BufferedReader r = new BufferedReader(new FileReader(t));
               while((line= r.readLine())!=null){
               line = line.replaceAll("[ ]+", " ");
               String text = " ";
               String [] lineSplit = line.split("\t");
               int length = lineSplit.length-2;

               for(int i=0; i<length ; i++){
                text = text + lineSplit[i+2];
                }
               String [] textSplit = text.split(" ");

               if (length!=0){

                if (!text.equals("[ ]+")); {
                  System.out.print(lineSplit[0]+"\t"+lineSplit[1]+"\t");
                }


               }

               for(int j=0; j<textSplit.length; j++){
                if (textSplit[j].matches("[-a-zA-Z]*")){
                 text = textSplit[j];
                 text = text.replaceAll("[ ]+", "");

                 if (!(text.equals(""))) {
                  System.out.print(text+" ");
                  }

               }

                else {

                }

                }
                System.out.print("\n");

              }
              r.close();

             }

             catch (IOException e) {
              // TODO Auto-generated catch block
              e.printStackTrace();
             }


        }

     }

After Running the Twitter Dataset Through the Workflow

Angela and Megha ran the tweets through the three programs and then the workflow (see below) on the last day, with the help Matheus. The tweets first went through Angela's program, which removed all the foreign language. After saving the result into a document, we ran it through Andrew's program which removed the HTML tags. The resulting output then went through Megha's code. Megha's program removed all the special characters and numbers from the tweets. The output of her program then went through the workflow and we looked through the output files. One was the displayed Topic Distribution Plot.

This is the workflow to get the topics.
This is the workflow to generate plots.
This is the topic distribution plot.
Personal tools