Filtering Out Noise - ShadowDragon.io

Filtering Out Noise

AD8756B4D0DF3EADE00473C8CAD9A1E0

When starting a project in OIMonitor, you are tempted to cast a wide net and grab from every source. This will bring in a lot of hits and you may even think, “Wow, this is awesome! I am awesome! My setup worked!” These would all be lies. Take for example a project I recently setup to track the mentions of infectious diseases across what I consider to be “news” sources: RSS; 4Chan; 8Chan; Forums; Reddit; etc. I setup an alert, a pretty easy one, with all of my subjects (entities), the sources above, and the words “infect”, and “outbreak”; I did not do “infection” or “infected” because OIMonitor will automatically un-stem –– take off the -ed and -ing and -ion –– from words it sees to make searching more intuitive.

alert creation
Creating an Alert in OIMonitor

 

I setup my entities to be the names of various infectious diseases collected from wikipedia —— Influenze, AIDS, SARS, Pnumonia, Plague, etc. —— with the common names for each disease as their aliases.

infectious diseases
A list of infectious diseases as OIMonitor Entities

 

All that was left was to watch the hits come in, and boy did they! I got 120 hits a day for the first couple of days. But after looking over the project, it seemed as though the subjects had not been set and only the words “infect” and “outbreak” were being collected. Easy enough fix; just save the alert with the Subjects set to “ALL”. That caused a decrease in the number of hits, whoohoo!

hits over time
The number of hits in a project over time.

 

But there was still some junk coming in. For days I was getting hits that matched but were not truely what I was looking for. These were artifacts with phrases like “I hope you get infected with AIDS” and “We need another outbreak of plague.” While they did match my alert, they were not what I was really looking for. I trcked down the sources of thesse false-positives and found that they all came from two sources: 4Chan amd 8Chan.

4Chan Chatter 1

4Chan Chatter 2

 

It seems that there is a lot chatter on those sources related to infectious diseases, but not about there spread. To cut down on this noise, I created a second Alert with just the 4Chan and 8Chan chatter.

Retweets

Another source of unwanted noise in a project can come from Twitter. If an infectious disease is really running amuck in the world, people are bound to talk about it on Twitter; it is why it was included as a source in the alert. However, if the outbreak of infection is really getting big, people will retweet the original tweet to give it more visibility. While this is great, it raises a big problem for our alert: we will get the same tweet multiple times, juist with RT @TwtterHandleHere at the beginning. We can get rid of this by adding an exclude keyword of type regex with the patter RT @.*. This will catch any tweet that is being retweeted and filter it out of the results.

Brooks M

Scroll to Top