What would be your list of essential text functions?

Well, here’s what I’m thinking:

  1. Regular expression replace (nice for cleaning)
  2. Regular expression match (nice for filtering out garbage)
  3. Regular expression split (very useful)
  4. word difference (ie Levenshtein)
  5. ngram (word / character — regex split would give same thing but this might make simpler)
  6. sentences (similar story as ngram)
  7. various language flavors of snowball stemmer (not perfect but simple)
  8. wordnet parser ( the confidence level that words in phrase are these language component types)
  9. HTML parser (seems like heavy fanboy mania on beautiful soup but don’t know how it really compares to tidy-ish)

Beyond the simple, these are supplementaries I’ve seen pop up in different projects from last couple months:

  1. Shingle and hash for potential duplication identification
  2. Conditional random fields for attribute extraction
  3. Miscellaneous scoring on: target word ratio to corpus, target word position to begin/end, target word count, etc