Wrangle Report - WeRateDogs
In this report I have summarised the process of data wrangling I have performed to gather the data from different sources to analyse and give interesting insights from the WeRateDogs twitter Handle.
WeRateDogs is a Twitter account that rates people’s dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because “they’re good dogs Brent.” WeRateDogs has over 4 million followers and has received international media coverage.
The goal of the project include:
- Wrangle the data through multiple sources, includin Twitter API, data from Udacity Server and from CSV file and perform the following process
- Data Gathering
- Data Assessment
- Data Cleaning
- Storing and Acting on Wrangled Data
Data Gathering
In this step, It was required to gather the data from three sources.
- Gathering the data form tweeter archive of WeRateDogs of 2017 which contains almost 6000 tweets. It can be found in the twitter_archive_enhanced.csv file.
- Fetching TSV file from udacity server with Image predictions of the dog breed form this [Link]
- The retweet and favourite data related to the tweets collected using Twitter API
Data Assessment
In this step I have assessed the data for the quality and the tideness. Quality assessment include:
- Completeness
- Validity
- Accuracy
- Consistency
I have found the following issues related to the data:
Twitter Archive :
- Quality Issues
- Empty values in in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp, Some missing values in expanded URLs
- Column with 835246439529840640 is having 0 as denominator
- Change timestamp type to datetime
- Unnecessary columns in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp not needed for analysis, source
- Unvalid names in name column like None, a, etc
- Tideness Issues
- The doggo, floofer, pupper puppo are in different colums instead of single column
- New column ‘rating’ which will be calculated as rating_numerator/rating_denominator
Image Prediction :
- Quality Issues
- Images with no prediction of dogs, pl_dog, p2_dog, p3_dog are false, total 324 such rows exists
- Underscores present in dog breed prediction names.
- Invalid names like terrapin, suit which are not dog breed exists
- dog breed names are not standardized, sometimes first letter is capitalized sometimes lower
- Tideness Issues
- Make a single column with dog breed and place the max prediction dog beed there
- Combine Image prediction and twitter archive dataframe
Tweets data from API :
- Quality Issues
- Columns which are not necessary for analysis are present
- Tideness Issues
- Combine tweet data and twitter archive dataframe
Cleaning Data
In this part I have cleaned the data and fixed the issues which I found in the assess part. Following operations were performed.
- Remove in tweets_archive the row with tweet_id 835246439529840640 as it is having 0 as denominator.
- Change the timestamp type to datetime in tweets_archive_copy
- Change None and a to in tweets_archive_copy.name to ‘Unknown’
- Drop columns which are not necessary. Namely, in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp
- Remove underscore from the prediction names using str.replace function of pandas
- Make new column for breed and combine the results of p1, p2, p3 and store the values form pl_dog, p2_dog, p3_dog
- Remove unnecessary columns from the image_pred_copy using pandas drop function
- Capitalize the first letter of breed name and rest lower using pandas lower() and capitalize() functions
- Remove Ids which are not in archive from the tweet_data_copy
- Combine the doggo floofer pupper puppo into a single column. Here we can not use melt function as, it is not fixed that dog belongs to one of the typed it has None values too and after melting when we remove the values which are not needed then None can not be removed.
- Drop the columns with doggo, floofer, pupper, puppo
- Merge the tweet archive and image_predictions using left join
- Merge the tweets_archive and tweet_data into single tabe using inner join operation
- Make new columnn rating which will be calculated as rating_numerator/rating_denominator
from subprocess import call
call(['python', '-m', 'nbconvert', 'wrangle_report.ipynb'])
0