Pure python utility for phone number extraction
Phone Number Matcher
The phone number matcher aims to extract phone numbers from url and text for DIG project.
Our DIG project needs a phone number extractor in python to get phone numbers from url or text, precision is more important than recall, and it should be compatible with spark.
phone-number-matcher is a utility in python to extract phone numbers from url or text. The precision is more important than recall, and thus a phone number validator is added at the end of extraction process based on Google’s libphonenumber.
It contains five components:
- Preprocessor: get rid of digits that must not belong to phone number
- Tokenizer: removepunctuations and tokenize original content
- Cleaner: clean misspelling number words and replace numeral words
- Extractor: extract phone numbers
- Validator: validate if phone number is valid
This library is available to use in Spark environment, and can process around 55 GB data within 25 minutes.
Since the phone number format in original dataset is very annoying, and thus the punctuations cannot be used for regular expression script to extract useful information. (The samples of annoying phone number format can be found here) Instead, I get rid of punctuations and clean data before processing, and add extra validation process after.