PolyU X RADICA BigDatathon 2018 - UBS Challenge (26-27 May 2018)


Over 6,000 Weibo posts (unstructured data) as extracted from 8 KOLs between 1st Feb and 30th Apr 2018 were provided. These posts should be grouped into different categories of interest first for easier analysis.
Challengers are required to:
A) Classify these Weibo posts into 13 categories of interest according to the below category ID:
0. Stock
1. Bond
2. Oil
3. Gold
4. Real Estate
5. Chinese Art (painting/ drawing/ calligraphy)
6. Western Art (painting/ drawing/ calligraphy)
7. Jewelry
8. Artifacts
9. Golf
10. Car
11. Overseas Education
12. Young Children Education
B) Sort out the ranking of the 13 interest categories (a) within each city and (b) across the cities according to their popularity based on the social network’s likes, retweets, comments and the city of the commentators of each post. The most popular interest should be put at first.
Analysis & Solutions:
Data Preprocessing:
- Remove Duplication of Data
- 7655 posts were provided by the organizer. However, some data were duplicated or have missing values. As there was limited time to complete the challenge, to obtain results quickly, data with low quality were deleted.
- Text Preprocessing
- Stopwords removal: meaningless words or phrases were removed like but, and, ...
- Punctuation, white spaces and other special characters were removed
- Sentences were tokenized into sequential phrases like bi-gram, tri-gram.... This was done so to retain the order of words which were important to interpret the complete meaning of the sentences.
- Data Modelling:
- Data were split into train dataset and test dataset.
- A list of train label is given by the organizer with a sample of 10% of the original dataset.
- Some categories have a very low number of counts in the training dataset.
- Data manipulation was done to get a more balanced training set. Careful checking of data quality was needed beforehand so as to ensure the labelling of the training sample was mostly correct.
- Model: 1-Dimension Convolutional Neural Networking
- Good for training sequential data
- Word embedding: Wikipedia
- Model Evaluation: Results (based on validation set)
- Accuracy = ~90%
- Loss = ~10%
For more details: https://github.com/cydalytics/Weibo_Posts_Topic_Classification
Dashboard (Power BI):For more details: https://github.com/cydalytics/Weibo_Posts_Topic_Classification
*part of the visuals are not opened under public access
https://app.powerbi.com/view?r=eyJrIjoiZWYwYjNhYmMtMmZmYy00ZDI5LWIxMjItZDFiZmI2MjczMzU0IiwidCI6IjI1MDczMGZkLWM3MjYtNDBlZS05OTEyLTQwNjcyMWRjNDg2YSIsImMiOjEwfQ%3D%3D
Demonstration:
^ This page is an overview of the words or phrases used by the given KOLs. You can use the filter to see what kind of words are mostly used by the KOLs and what phrases usually receive a higher number of likes, comments or retweets.
^ This page is an overview of the category analysis. We proposed a special calculation formula for the score, i.e. 1 mark for each like, 3 marks for each comment, 6 marks for each retweet. Thus, if a Weibo user gave a like, a comment and a retweet for a particular post. This is counted as 10 marks.
In China, Oversea Education received the highest score among all the categories. This showed that this was the category that most Chinese Weibo users concerned about.
When a filter applied, like in the picture, Sichuan was selected, and the composition of the score was shown on the left-hand side. From the distribution shown, "Golf" was a more popular topic among Sichuan's users.
^ This page allows you to compare between categories, between regions or cities or any other combination of filters
Best Presentation

Comments
Post a Comment