What would be your list of essential text functions?
Well, here’s what I’m thinking:
- Regular expression replace (nice for cleaning)
- Regular expression match (nice for filtering out garbage)
- Regular expression split (very useful)
- word difference (ie Levenshtein)
- ngram (word / character — regex split would give same thing but this might make simpler)
- sentences (similar story as ngram)
- various language flavors of snowball stemmer (not perfect but simple)
- wordnet parser ( the confidence level that words in phrase are these language component types)
- HTML parser (seems like heavy fanboy mania on beautiful soup but don’t know how it really compares to tidy-ish)
Beyond the simple, these are supplementaries I’ve seen pop up in different projects from last couple months:
- Shingle and hash for potential duplication identification
- Conditional random fields for attribute extraction
- Miscellaneous scoring on: target word ratio to corpus, target word position to begin/end, target word count, etc